3

I am working with very high dimensional vectors for machine learning and was thinking about using numpy to reduce the amount of memory used. I run a quick test to see how much memory I could save using numpy (1)(3):

Standard list

import random
random.seed(0)
vector = [random.random() for i in xrange(2**27)]

Numpy array

import numpy
import random
random.seed(0)
vector = numpy.fromiter((random.random() for i in xrange(2**27)), dtype=float)

Memory usage (2)

Numpy array: 1054 MB
Standard list: 2594 MB

Just like I expected.

By allocing a continues block of memory with native floats numpy only consumes about half of the memory the standard list is using.

Because I know my data is pretty spare, I did the same test with sparse data.

Standard list

import random
random.seed(0)
vector = [random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)]

Numpy array

from numpy import fromiter
from random import random
random.seed(0)
vector = numpy.fromiter((random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)), dtype=float)

Memory usage (2)

Numpy array: 1054 MB
Standard list: 529 MB

Now all of the sudden, the python list uses half the amount of memory the numpy array uses! Why?

One thing I could think of is that python dynamically switches to a dict representation when it detects that it contains very sparse data. Checking this could potentially add a lot of extra run-time overhead so I don't really think that this is going on.

Notes

  1. I started a fresh new python shell for every test.
  2. Memory measured with htop.
  3. Run on 32bit Debian.
zeebonk
  • 4,274
  • 3
  • 19
  • 28
  • Is this on a 32 bit or a 64 bit system? I suspect that the pointers to 0 in the Python list are smaller than the numpy floats. – PM 2Ring Apr 14 '15 at 12:43
  • `0` literal is an integer, so your list is built mostly from ints, where in numpy array you have all floats. What is more, small integers (-5..255) are interned, so all those zeros in the list point to the same object. Try using `0.0` and see if there is a difference. – m.wasowski Apr 14 '15 at 12:43
  • @PM2Ring: Tests are run on 32bits Debian. Updated the question accordingly. – zeebonk Apr 14 '15 at 13:05
  • @m.wasowski Changed the literal int zero to a float. Same results. Updated the question accordingly. – zeebonk Apr 14 '15 at 13:06

1 Answers1

2

A Python list is just an array of references (pointers) to Python objects. In CPython (the usual Python implementation) a list gets slightly over-allocated to make expansion more efficient, but it never gets converted to a dict. See the source code for further details: List object implementation

In the sparse version of the list, you have a lot of pointers to a single int 0 object. Those pointers take up 32 bits = 4 bytes, but your numpy floats are certainly larger, probably 64 bits.

FWIW, to make the sparse list / array tests more accurate you should call random.seed(some_const) with the same seed in both versions so that you get the same number of zeroes in both the Python list and the numpy array.

PM 2Ring
  • 50,023
  • 5
  • 64
  • 150
  • Numpy indeed defaults to 64bits floats, forcing them to 32bits resulted in a comparable in size array. Thanks! – zeebonk Apr 17 '15 at 12:51