62

I am trying to implement algorithms for 1000-dimensional data with 200k+ datapoints in python. I want to use numpy, scipy, sklearn, networkx and other usefull libraries. I want to perform operations such as pairwise distance between all of the points and do clustering on all of the points. I have implemented working algorithms that perform what I want with reasonable complexity but when I try to scale them to all of my data I run out of ram. Of course I do, creating the matrix for pairwise distances on 200k+ data takes alot of memory.

Here comes the catch: I would really like to do this on crappy computers with low amounts of ram.

Is there a feasible way for me to make this work without the constraints of low ram. That it will take a much longer time is really not a problem, as long as the time reqs don't go to infinity!

I would like to be able to put my algorithms to work and then come back an hour or five later and not have it stuck because it ran out of ram! I would like to implement this in python, and be able to use the numpy, scipy, sklearn and networkx libraries. I would like to be able to calculate the pairwise distance to all my points etc

Is this feasible? And how would I go about it, what can I start to read up on?

Best regards // Mesmer

Saullo G. P. Castro
  • 49,101
  • 22
  • 160
  • 223
Ekgren
  • 914
  • 1
  • 8
  • 12
  • 3
    I want to be able to perform, for example, pairwise distance between all points in a 200.000 x 1000 matrix in python without having enough ram to keep the whole distance matrix in memory. I am looking for information on how to do that :) so more concrete answers then a vague "look into two whole sub fields of computer science" would be helpfull! – Ekgren Apr 22 '13 at 15:26
  • 8
    You probably want to take a look at numpy's [memmap](http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html) and possibly [PyTables](http://www.pytables.org) as a starting point. – Henry Gomersall Apr 22 '13 at 15:42
  • From the first related question below the user @cronos suggested to [use `h5py`](http://www.h5py.org/docs/intro/quick.html#quick), and I believe it can be used for your problem too. 1-[Is it possible to np.concatenate memory-mapped files?](http://stackoverflow.com/questions/13780907/is-it-possible-to-np-concatenate-memory-mapped-files) 2-[Concatenate Numpy arrays without copying](http://stackoverflow.com/questions/7869095/concatenate-numpy-arrays-without-copying) – Saullo G. P. Castro May 02 '13 at 17:03

1 Answers1

79

Using numpy.memmap you create arrays directly mapped into a file:

import numpy
a = numpy.memmap('test.mymemmap', dtype='float32', mode='w+', shape=(200000,1000))
# here you will see a 762MB file created in your working directory    

You can treat it as a conventional array: a += 1000.

It is possible even to assign more arrays to the same file, controlling it from mutually sources if needed. But I've experiences some tricky things here. To open the full array you have to "close" the previous one first, using del:

del a    
b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(200000,1000))

But openning only some part of the array makes it possible to achieve the simultaneous control:

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000))
b[1,5] = 123456.
print a[1,5]
#123456.0

Great! a was changed together with b. And the changes are already written on disk.

The other important thing worth commenting is the offset. Suppose you want to take not the first 2 lines in b, but lines 150000 and 150001.

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000),
                 offset=150000*1000*32/8)
b[1,2] = 999999.
print a[150001,2]
#999999.0

Now you can access and update any part of the array in simultaneous operations. Note the byte-size going in the offset calculation. So for a 'float64' this example would be 150000*1000*64/8.

Other references:

Community
  • 1
  • 1
Saullo G. P. Castro
  • 49,101
  • 22
  • 160
  • 223
  • 1
    Sorry I didn't understand what you've done. Have you created a file using 'w+' called 'test.mymemmap' which you've stored into the memory by assigning the variable 'a'. But then you've deleted it, and then read the file using 'r+' and stored in the variable 'b'. I'm not sure of what you've deed. I have a large file called 'myfile.npy', which I want to read in batches... – Marlon Teixeira Aug 25 '20 at 14:41