19

I am trying to calculate the cosine similarity of 100,000 vectors, and each of these vectors has 200,000 dimensions.

From reading other questions I know that memmap, PyTables and h5py are my best bets for handling this kind of data, and I am currently working with two memmaps; one for reading the vectors, the other for storing the matrix of cosine similarities.

Here is my code:

import numpy as np
import scipy.spatial.distance as dist

xdim = 200000
ydim = 100000

wmat = np.memmap('inputfile', dtype = 'd', mode = 'r', shape = (xdim,ydim))
dmat = np.memmap('outputfile', dtype = 'd', mode = 'readwrite', shape = (ydim,ydim))

for i in np.arange(ydim)):
    for j in np.arange(i+1,ydim):
        dmat[i,j] = dist.cosine(wmat[:,i],wmat[:,j])
        dmat.flush()

Currently, htop reports that I am using 224G of VIRT memory, and 91.2G of RES memory which is climbing steadily. It seems to me as if, by the end of the process, the entire output matrix will be stored in memory, which is something I'm trying to avoid.

QUESTION: Is this a correct usage of memmaps, am I writing to the output file in a memory efficient manner (by which I mean that only the necessary parts of the in- and output files i.e. dmat[i,j] and wmat[:,i/j], are stored in memory)?

If not, what did I do wrong, and how can I fix this?

Thanks for any advice you may have!

EDIT: I just realized that htop is reporting total system memory usage at 12G, so it seems it is working after all... anyone out there who can enlighten me? RES is now at 111G...

EDIT2: The memmap is created from a 1D array consisting of lots and lots of long decimals quite close to 0, which is shaped to the desired dimensions. The memmap then looks like this.

memmap([[  9.83721223e-03,   4.42584107e-02,   9.85033578e-03, ...,
     -2.30691545e-07,  -1.65070799e-07,   5.99395837e-08],
   [  2.96711345e-04,  -3.84307391e-04,   4.92968462e-07, ...,
     -3.41317722e-08,   1.27959347e-09,   4.46846438e-08],
   [  1.64766260e-03,  -1.47337747e-05,   7.43660202e-07, ...,
      7.50395136e-08,  -2.51943163e-09,   1.25393555e-07],
   ..., 
   [ -1.88709000e-04,  -4.29454722e-06,   2.39720287e-08, ...,
     -1.53058717e-08,   4.48678211e-03,   2.48127260e-07],
   [ -3.34207882e-04,  -4.60275148e-05,   3.36992876e-07, ...,
     -2.30274532e-07,   2.51437794e-09,   1.25837564e-01],
   [  9.24923862e-04,  -1.59552854e-03,   2.68354822e-07, ...,
     -1.08862665e-05,   1.71283316e-07,   5.66851420e-01]])
Jojanzing
  • 303
  • 1
  • 9
  • I wouldn't say this question is "wrong" for SO but you will probably get a better answer at http://codereview.stackexchange.com as this more about architecture then an an actual bug or how-to. – Victory Aug 23 '15 at 01:20
  • 4
    @Victory CR is more about *code* than architecture. Not saying it is "wrong" for CR, but I think OP will probably get a better answer at SO. :) – Simon Forsberg Aug 23 '15 at 01:23
  • well, basically I'm asking for a how-to read/write large files from/to disk efficiently. I'm confused because I am getting conflicting information from htop =S – Jojanzing Aug 23 '15 at 01:33
  • If you need to spread out over multiple machines, and you want to run in python, look at [apache spark](http://spark.apache.org). – Paul Aug 23 '15 at 01:34
  • @Paul thanks for the tip =) currently I only have access to a single machine. – Jojanzing Aug 23 '15 at 01:38
  • 1
    can you add a sample of your input file? Also why are you writing to a memmap if you don't want to store in memory, why not write straight to disk? – Padraic Cunningham Aug 23 '15 at 10:18
  • @Padraic Cunningham, I want to store my data in a 2D array structure, and I'm not sure how to write an array elementwise straight to disk, do you have any links/examples? I will add a sample of my input file. – Jojanzing Aug 23 '15 at 10:44
  • @Jojanzing, do you want the data stored as a 2d array on disk and are you actually going to be reading from `dmat` also? – Padraic Cunningham Aug 23 '15 at 11:10
  • @PadraicCunningham, I guess I could just store it as a 1D array and read it as a 2D array, like I did with the input file. I will be reading from `dmat` to determine pairs of similar items somewhere down the line, yes (but it does not necessarily have to occur while I'm calculating `dmat`). – Jojanzing Aug 23 '15 at 11:30
  • @PadraicCunningham is it possible to write a 1D array element-by-element straight to disk? I guess by adding another few bytes to the file each time a cosine similarity is calculated... – Jojanzing Aug 23 '15 at 12:37

2 Answers2

8

In terms of memory usage, there's nothing particularly wrong with what you're doing at the moment. Memmapped arrays are handled at the level of the OS - data to be written is usually held in a temporary buffer, and only committed to disk when the OS deems it necessary. Your OS should never allow you to run out of physical memory before flushing the write buffer.

I'd advise against calling flush on every iteration since this defeats the purpose of letting your OS decide when to write to disk in order to maximise efficiency. At the moment you're only writing individual float values at a time.


In terms of IO and CPU efficiency, operating on a single line at a time is almost certainly suboptimal. Reads and writes are generally quicker for large, contiguous blocks of data, and likewise your calculation will probably be much faster if you can process many lines at once using vectorization. The general rule of thumb is to process as big a chunk of your array as will fit in memory (including any intermediate arrays that are created during your computation).

Here's an example showing how much you can speed up operations on memmapped arrays by processing them in appropriately-sized chunks.

Another thing that can make a huge difference is the memory layout of your input and output arrays. By default, np.memmap gives you a C-contiguous (row-major) array. Accessing wmat by column will therefore be very inefficient, since you're addressing non-adjacent locations on disk. You would be much better off if wmat was F-contiguous (column-major) on disk, or if you were accessing it by row.

The same general advice applies to using HDF5 instead of memmaps, although bear in mind that with HDF5 you will have to handle all the memory management yourself.

Community
  • 1
  • 1
ali_m
  • 62,795
  • 16
  • 193
  • 270
  • To get an F-contiguous, column-major array, would it suffice to create the memmap with `order = 'F'`? Thanks for the detailed description. The code in the link looks great too, I will give that a try. – Jojanzing Aug 23 '15 at 12:50
  • 1
    That wouldn't help in your example, since `wmat` is a pre-existing array on disk that you're opening in read-only mode. You would have to actually write `wmat` to disk in column-major format to begin with. – ali_m Aug 23 '15 at 12:56
  • ah I see... I'll keep that in mind in future. One last question, are there any compelling reasons to use HDF5 over memmaps? – Jojanzing Aug 23 '15 at 13:36
  • 2
    Speed, compression, portability... Joe Kington's answer [here](http://stackoverflow.com/a/27713489/1461210) does a pretty good job of covering the pros and cons. – ali_m Aug 23 '15 at 13:39
7

Memory maps are exactly what the name says: mappings of (virtual) disk sectors into memory pages. The memory is managed by the operating system on demand. If there is enough memory, the system keeps parts of the files in memory, maybe filling up the whole memory, if there is not enough left, the system may discard pages read from file or may swap them into swap space. Normally you can rely on the OS is as efficient as possible.

Daniel
  • 39,063
  • 4
  • 50
  • 76
  • I see, it's using as much memory as it can, as efficiently as possible. Thanks for clearing that up! Do you have any idea why the system usage is only 12G even though VIRT is at 224G and RES now stable at 149G? – Jojanzing Aug 23 '15 at 12:28