I am working with very high dimensional vectors for machine learning and was thinking about using numpy to reduce the amount of memory used. I run a quick test to see how much memory I could save using numpy (1)(3):
Standard list
import random
random.seed(0)
vector = [random.random() for i in xrange(2**27)]
Numpy array
import numpy
import random
random.seed(0)
vector = numpy.fromiter((random.random() for i in xrange(2**27)), dtype=float)
Memory usage (2)
Numpy array: 1054 MB
Standard list: 2594 MB
Just like I expected.
By allocing a continues block of memory with native floats numpy only consumes about half of the memory the standard list is using.
Because I know my data is pretty spare, I did the same test with sparse data.
Standard list
import random
random.seed(0)
vector = [random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)]
Numpy array
from numpy import fromiter
from random import random
random.seed(0)
vector = numpy.fromiter((random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)), dtype=float)
Memory usage (2)
Numpy array: 1054 MB
Standard list: 529 MB
Now all of the sudden, the python list uses half the amount of memory the numpy array uses! Why?
One thing I could think of is that python dynamically switches to a dict representation when it detects that it contains very sparse data. Checking this could potentially add a lot of extra run-time overhead so I don't really think that this is going on.
Notes
- I started a fresh new python shell for every test.
- Memory measured with htop.
- Run on 32bit Debian.