1

Hi i am trying to generate an adjacency matrix with a dimension of about 24,000 from a CSV with two columns showing combinations of pairs of genes and a column of 1's to indicate a present interaction....My goal is to have it be square and populated with zeros for combinations not in the two columns

I am using the following Python script

import numpy as np
from scipy.sparse import coo_matrix

l, c, v = np.loadtxt("biogrid2.csv", dtype=(int), skiprows=0, delimiter=",").T[:3, :]
m =coo_matrix((l, (v-1, c-1)), shape=(v.max(), c.max()))

m.toarray()

and it runs ok until encountering the following errorIt seems

File "/home/charlie/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py", line 1184, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)

MemoryError

Any ideas about how to get around the memory limit in Scipy

Thanks

Daniel F
  • 11,893
  • 1
  • 21
  • 50
Michael Sughrue
  • 131
  • 1
  • 8
  • Will this help? https://stackoverflow.com/a/8980156/4909087 – cs95 Oct 01 '18 at 07:50
  • It created `m` ok. The toarray step hits the memory limit. – hpaulj Oct 01 '18 at 08:38
  • 1
    What do you want to do with the array created by `m.toarray`? Because there are very few good reasons to turn an sparse array to dense, especially since it usually causes memory errors. – Daniel F Oct 01 '18 at 08:52
  • i want to basically use it to perform enrichment analyses of cancer gene expression so i want to use it as a giant adjacency matrix...not wedded to dense arrays...just want to have a matrix i can perform analyses on and eventually access the final matrix to perform survival analyses on – Michael Sughrue Oct 01 '18 at 08:59
  • The important part of the question is what *specific* types of analysis you want to do. Eigen-analysis? Linear algebra? Basic boolean math? Most of us have no idea how to do genetics, but we can talk math all day. – Daniel F Oct 01 '18 at 09:05
  • In short, I have about 160 vectors of about 23000 expression levels (compared to baseline) in people with brain cancer ....i want to put these levels into the nodes of the adjacency matrix and change the edge strengths using a random walk type paradigm, then take the edge weights, and analyze them as covariates – Michael Sughrue Oct 01 '18 at 09:12
  • So basically you need to do matrix dot products for the random walk and covariance martrix, but don't really need an eigen-analysis of the data. Correct? Remember, our common language here is math. I don't know anything about cancer except it sucks. – Daniel F Oct 01 '18 at 09:18
  • Yes this is correct.....i just want to put some weights on the nodes of this matrix based on weights from the vectors i have and then perform a random walk and export the edge weights in a csv file – Michael Sughrue Oct 01 '18 at 09:23

2 Answers2

1

Most likely what you want isn't m.toarray but m.tocsr(). a csr matrix can do simple linear algebra (like .dot() and matrix powers) natively, for instance this works:

m.tocsr()
random_walk_2 = m.dot(m)
random_walk_n = m ** n  
# see https://stackoverflow.com/questions/28702416/matrix-power-for-sparse-matrix-in-python

Covariance should be implementable as well, but I'm not sure what the specific implementation would be without seeing what your current process is.

EDIT: To turn the output back into a simpler format to read out to csv, you can follow up by returning to coo with .tocoo()

m.tocoo()
out = np.c_[m.data, m.row, m.col].T
np.savetxt("foo.csv", out, delimiter=",") 
# see https://stackoverflow.com/questions/6081008/dump-a-numpy-array-into-a-csv-file
Daniel F
  • 11,893
  • 1
  • 21
  • 50
0

The function toarray() will convert your 24000*24000 sparse matrix (coo_matrix) into a dense array of 24000*24000 (assuming you are loading int) which needs in terms of memory at least

24000*24000*4 = around 2,15Gb.

To avoid using so much memory you should avoid converting to dense matrix (using toarray()) and do your operations with sparse matrix

If you need your matrix squared you can just do m*m or m.multiply(m) and you will get a sparse matrix.

To save your matrix you have several option.

Simplest one is NPZ see https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.sparse.save_npz.html or Save / load scipy sparse csr_matrix in portable data format

If you want to get your result as your initial CSV file coo_matrix has attributes

data COO format data array of the matrix

row COO format row index array of the matrix

col COO format column index array of the matrix

see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html

which can be used to create the CSV file.

PilouPili
  • 2,372
  • 2
  • 13
  • 27