2

Let's consider a list of large integers, for example one given by:

def primesfrom2to(n):
    # http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
    """ Input n>=6, Returns a array of primes, 2 <= p < n """
    sieve = np.ones(n/3 + (n%6==2), dtype=np.bool)
    sieve[0] = False
    for i in xrange(int(n**0.5)/3+1):
        if sieve[i]:
            k=3*i+1|1
            sieve[      ((k*k)/3)      ::2*k] = False
            sieve[(k*k+4*k-2*k*(i&1))/3::2*k] = False
    return np.r_[2,3,((3*np.nonzero(sieve)[0]+1)|1)]

primesfrom2to(2000000)

I want to calculate the sum of that, and the expected result is 142913828922. But if I do:

sum(primesfrom2to(2000000))

I get 1179908154, which is clearly wrong. The problem is that I have an int overflow, but I don't understand why. Let's me explain.Consider this testing code:

a=primesfrom2to(2000000)
b=[float(i) for i in a]
c=[long(i) for i in a]
sumI=0
sumF=0
sumL=0
m=0
for i,j,k in zip(a,b,c):
    m=m+1
    sumI=sumI+i
    sumF=sumF+j
    sumL=sumL+k
    print sumI,sumF,sumL
    if sumI<0:
        print i,m
        break

I found out that the first integer overflow is happening at a[i=20444]=225289

If I do:

>>> sum(a[:20043])+225289
-2147310677

But if I do:

>>> sum(a[:20043])
2147431330
>>> 2147431330+225289
2147656619L

What's happening? Why such a different behaviour? Why can't sum switch automatically to long type and give the correct result?

Pierpaolo
  • 1,511
  • 4
  • 18
  • 31

1 Answers1

9

Look at the types of your results. You are summing a numpy array, which is using numpy datatypes, which can overflow. When you do sum(a[:20043]), you get a numpy object back (some sort of int32 or the like), which overflows when added to another number. When you manually type in the same number, you're creating a Python builtin int, which can auto-promote to long. Numpy arrays cannot autopromote like Python builtin types, because the array type (and its memory layout) have to be fixed when the array is created. This makes operations much faster at the expense of type flexibility.

You may be able to get around the problem by using a different datatype (like np.int64) instead of np.bool. However, it depends how big your numbers are. A simple example:

# Python types ok
>>> 2**62
4611686018427387904L
>>> 2**63
9223372036854775808L

# numpy types overflow
>>> np.int64(2)**62
4611686018427387904
>>> np.int64(2)**63
-9223372036854775808

Your example works correctly for me on 64-bit Python, so I guess you're using 32-bit Python. If you can use 64-bit types you will be able to get past the limit you found, but as my example shows you will eventually overflow 64-bit ints too if your numbers get super huge.

BrenBarn
  • 210,788
  • 30
  • 364
  • 352
  • I never thought that np.int where different from python int. Good to know :-) – Pierpaolo May 28 '14 at 23:42
  • 2
    This is correct, and just to add to why no auto promote, its (mostly) because of the great speed increases you get by using the fixed data size which the processors are optimized for. numpy likes to be fast. – Victory May 28 '14 at 23:42
  • @BrenBarn, is this system dependent somehow as I get the correct answer running the OP's code? – Padraic Cunningham May 28 '14 at 23:53
  • @PadraicCunningham: Yes, it will depend on whether you're running 64-bit or 32-bit Python. – BrenBarn May 28 '14 at 23:56