10

I am trying load a sparse array that I have previously saved. Saving the sparse array was easy enough. Trying to read it though is a pain. scipy.load returns a 0d array around my sparse array.

import scipy as sp
A = sp.load("my_array"); A
array(<325729x325729 sparse matrix of type '<type 'numpy.int8'>'
with 1497134 stored elements in Compressed Sparse Row format>, dtype=object)

In order to get a sparse matrix I have to flatten the 0d array, or use sp.asarray(A). This seems like a really hard way to do things. Is Scipy smart enough to understand that it has loaded a sparse array? Is there a better way to load a sparse array?

Rob Cowie
  • 21,066
  • 6
  • 59
  • 56
iform
  • 213
  • 1
  • 4
  • 9

3 Answers3

15

The mmwrite/mmread functions in scipy.io can save/load sparse matrices in the Matrix Market format.

scipy.io.mmwrite('/tmp/my_array',x)
scipy.io.mmread('/tmp/my_array').tolil()    

mmwrite and mmread may be all you need. It is well-tested and uses a well-known format.

However, the following might be a bit faster:

We can save the the row and column coordinates and data as 1-d arrays in npz format.

import random
import scipy.sparse as sparse
import scipy.io
import numpy as np

def save_sparse_matrix(filename,x):
    x_coo=x.tocoo()
    row=x_coo.row
    col=x_coo.col
    data=x_coo.data
    shape=x_coo.shape
    np.savez(filename,row=row,col=col,data=data,shape=shape)

def load_sparse_matrix(filename):
    y=np.load(filename)
    z=sparse.coo_matrix((y['data'],(y['row'],y['col'])),shape=y['shape'])
    return z

N=20000
x = sparse.lil_matrix( (N,N) )
for i in xrange(N):
    x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)

save_sparse_matrix('/tmp/my_array',x)
load_sparse_matrix('/tmp/my_array.npz').tolil()

Here is some code which suggests saving the sparse matrix in an npz file may be quicker than using mmwrite/mmread:

def using_np_savez():    
    save_sparse_matrix('/tmp/my_array',x)
    return load_sparse_matrix('/tmp/my_array.npz').tolil()

def using_mm():
    scipy.io.mmwrite('/tmp/my_array',x)
    return scipy.io.mmread('/tmp/my_array').tolil()    

if __name__=='__main__':
    for func in (using_np_savez,using_mm):
        y=func()
        print(repr(y))
        assert(x.shape==y.shape)
        assert(x.dtype==y.dtype)
        assert(x.__class__==y.__class__)    
        assert(np.allclose(x.todense(),y.todense()))

yields

% python -mtimeit -s'import test' 'test.using_mm()'
10 loops, best of 3: 380 msec per loop

% python -mtimeit -s'import test' 'test.using_np_savez()'
10 loops, best of 3: 116 msec per loop
unutbu
  • 711,858
  • 148
  • 1,594
  • 1,547
  • 1
    +1, `scipy.io` is the proper solution. I would add that if you want to go down the optimization road, you might consider `numpy.load(mmap_mode='r'/'c')`. Memory-mapping the files from disk gives instant load **and** can save memory, as the same memory-mapped array can be shared across multiple processes. – Radim Jul 19 '11 at 21:07
  • scipy.io.savemat is probably the best – mathtick Mar 27 '13 at 15:11
  • Using np_savez instead of mm decreased my loading time of a big sparse matrix from 8min47 to 3s ! Thanks ! I also tried savez_compressed but the size is the same and the loading time much longer. – MatthieuBizien Mar 01 '14 at 02:38
6

One can extract the object hidden away in the 0d array using () as index:

A = sp.load("my_array")[()]

This looks weird, but it seems to work anyway, and it is a very short workaround.

user4713166
  • 61
  • 1
  • 1
1

For all the up votes of the mmwrite answer, I'm surprised no one tried to answer the actual question. But since it has been reactivated, I'll give it a try.

This reproduces the OP case:

In [90]: x=sparse.csr_matrix(np.arange(10).reshape(2,5))
In [91]: np.save('save_sparse.npy',x)
In [92]: X=np.load('save_sparse.npy')
In [95]: X
Out[95]: 
array(<2x5 sparse matrix of type '<type 'numpy.int32'>'
    with 9 stored elements in Compressed Sparse Row format>, dtype=object)
In [96]: X[()].A
Out[96]: 
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [93]: X[()].A
Out[93]: 
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])
In [94]: x
Out[94]: 
<2x5 sparse matrix of type '<type 'numpy.int32'>'
    with 9 stored elements in Compressed Sparse Row format

The [()] that `user4713166 gave us is not a 'hard way' to extract the sparse array.

np.save and np.load are designed to operate on ndarrays. But a sparse matrix is not such an array, nor is it a subclass (as np.matrix is). It appears that np.save wraps the non-array object in an object dtype array, and saves it along with a pickled form of the object.

When I try to save a different kind of object, one that can't be pickled, I get an error message at:

403  # We contain Python objects so we cannot write out the data directly.
404  # Instead, we will pickle it out with version 2 of the pickle protocol.

--> 405 pickle.dump(array, fp, protocol=2)

So in answer to Is Scipy smart enough to understand that it has loaded a sparse array?, no. np.load does not know about sparse arrays. But np.save is smart enough to punt when given something that isn't an array, and np.load does what it can with what if finds in the file.

As to alternative methods of saving and loading sparse arrays, the io.savemat, MATLAB compatible method, has been mentioned. It would be my first choice. But this example also shows that you can use the regular Python pickling. That might be better if you need to save a particular sparse format. And np.save isn't bad if you can live with the [()] extraction step. :)


https://github.com/scipy/scipy/blob/master/scipy/io/matlab/mio5.py write_sparse - sparse are saved in csc format. Along with headers it saves A.indices.astype('i4')), A.indptr.astype('i4')), A.data.real, and optionally A.data.imag.


In quick tests I find that np.save/load handles all sparse formats, except dok, where the load complains about a missing shape. Otherwise I'm not finding any special pickling code in the sparse files.

hpaulj
  • 175,871
  • 13
  • 170
  • 282