I am trying to calculate the cosine similarity of 100,000 vectors, and each of these vectors has 200,000 dimensions.
From reading other questions I know that memmap, PyTables and h5py are my best bets for handling this kind of data, and I am currently working with two memmaps; one for reading the vectors, the other for storing the matrix of cosine similarities.
Here is my code:
import numpy as np
import scipy.spatial.distance as dist
xdim = 200000
ydim = 100000
wmat = np.memmap('inputfile', dtype = 'd', mode = 'r', shape = (xdim,ydim))
dmat = np.memmap('outputfile', dtype = 'd', mode = 'readwrite', shape = (ydim,ydim))
for i in np.arange(ydim)):
for j in np.arange(i+1,ydim):
dmat[i,j] = dist.cosine(wmat[:,i],wmat[:,j])
dmat.flush()
Currently, htop reports that I am using 224G of VIRT memory, and 91.2G of RES memory which is climbing steadily. It seems to me as if, by the end of the process, the entire output matrix will be stored in memory, which is something I'm trying to avoid.
QUESTION:
Is this a correct usage of memmaps, am I writing to the output file in a memory efficient manner (by which I mean that only the necessary parts of the in- and output files i.e. dmat[i,j]
and wmat[:,i/j]
, are stored in memory)?
If not, what did I do wrong, and how can I fix this?
Thanks for any advice you may have!
EDIT: I just realized that htop is reporting total system memory usage at 12G, so it seems it is working after all... anyone out there who can enlighten me? RES is now at 111G...
EDIT2: The memmap is created from a 1D array consisting of lots and lots of long decimals quite close to 0, which is shaped to the desired dimensions. The memmap then looks like this.
memmap([[ 9.83721223e-03, 4.42584107e-02, 9.85033578e-03, ...,
-2.30691545e-07, -1.65070799e-07, 5.99395837e-08],
[ 2.96711345e-04, -3.84307391e-04, 4.92968462e-07, ...,
-3.41317722e-08, 1.27959347e-09, 4.46846438e-08],
[ 1.64766260e-03, -1.47337747e-05, 7.43660202e-07, ...,
7.50395136e-08, -2.51943163e-09, 1.25393555e-07],
...,
[ -1.88709000e-04, -4.29454722e-06, 2.39720287e-08, ...,
-1.53058717e-08, 4.48678211e-03, 2.48127260e-07],
[ -3.34207882e-04, -4.60275148e-05, 3.36992876e-07, ...,
-2.30274532e-07, 2.51437794e-09, 1.25837564e-01],
[ 9.24923862e-04, -1.59552854e-03, 2.68354822e-07, ...,
-1.08862665e-05, 1.71283316e-07, 5.66851420e-01]])