List with sparse data consumes less memory then the same data as numpy array

Question

I am working with very high dimensional vectors for machine learning and was thinking about using numpy to reduce the amount of memory used. I run a quick test to see how much memory I could save using numpy (1)(3):

Standard list

import random
random.seed(0)
vector = [random.random() for i in xrange(2**27)]

Numpy array

import numpy
import random
random.seed(0)
vector = numpy.fromiter((random.random() for i in xrange(2**27)), dtype=float)

Memory usage (2)

Numpy array: 1054 MB
Standard list: 2594 MB

Just like I expected.

By allocing a continues block of memory with native floats numpy only consumes about half of the memory the standard list is using.

Because I know my data is pretty spare, I did the same test with sparse data.

Standard list

import random
random.seed(0)
vector = [random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)]

Numpy array

from numpy import fromiter
from random import random
random.seed(0)
vector = numpy.fromiter((random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)), dtype=float)

Memory usage (2)

Numpy array: 1054 MB
Standard list: 529 MB

Now all of the sudden, the python list uses half the amount of memory the numpy array uses! Why?

One thing I could think of is that python dynamically switches to a dict representation when it detects that it contains very sparse data. Checking this could potentially add a lot of extra run-time overhead so I don't really think that this is going on.

Notes

I started a fresh new python shell for every test.
Memory measured with htop.
Run on 32bit Debian.

Is this on a 32 bit or a 64 bit system? I suspect that the pointers to 0 in the Python list are smaller than the numpy floats. — PM 2Ring, Apr 14 '15 at 12:43
`0` literal is an integer, so your list is built mostly from ints, where in numpy array you have all floats. What is more, small integers (-5..255) are interned, so all those zeros in the list point to the same object. Try using `0.0` and see if there is a difference. — m.wasowski, Apr 14 '15 at 12:43
@PM2Ring: Tests are run on 32bits Debian. Updated the question accordingly. — zeebonk, Apr 14 '15 at 13:05
@m.wasowski Changed the literal int zero to a float. Same results. Updated the question accordingly. — zeebonk, Apr 14 '15 at 13:06

score 2 · Accepted Answer · answered Apr 14 '15 at 13:23

A Python list is just an array of references (pointers) to Python objects. In CPython (the usual Python implementation) a list gets slightly over-allocated to make expansion more efficient, but it never gets converted to a dict. See the source code for further details: List object implementation

In the sparse version of the list, you have a lot of pointers to a single int 0 object. Those pointers take up 32 bits = 4 bytes, but your numpy floats are certainly larger, probably 64 bits.

FWIW, to make the sparse list / array tests more accurate you should call random.seed(some_const) with the same seed in both versions so that you get the same number of zeroes in both the Python list and the numpy array.

Numpy indeed defaults to 64bits floats, forcing them to 32bits resulted in a comparable in size array. Thanks! — zeebonk, Apr 17 '15 at 12:51

List with sparse data consumes less memory then the same data as numpy array

1 Answers1