0

I have a document-term matrix of size 4280 x 90140 (there are 4280 documents and 90140 unique words) called sparse_dtm, represented as a scipy dok_matrix.

For each word j in document i, I want to calculate the probability. This is calculated as the count of word j in document i, divided by the total count of word j in all documents. To do so, I have the following code:

for j in range(num_documents):
    row_counts = sparse_dtm.getrow(j).toarray()[0]
    word_index = row_counts.nonzero()[0]
    non_zero_row_counts = row_counts[row_counts != 0]
         
    for i, count in enumerate(non_zero_row_counts):
        word = index_to_word[word_index[i]]
        prob_ji = count/sum_words[word]
        p_ji[j,word_index[i]] = prob_ji 

With:

index_to_word = dictionary with: key: index of a word, value: word

word_to_index = dictionary with: key: word, value: index of a word

vocabulary = set()
for text in data_list:
    vocabulary.update(text)

word_to_index = dict()
index_to_word = dict()

for i, word in enumerate(vocabulary):
    word_to_index[word] = i
    index_to_word[i] = word

sum_words = dictionary with: key: word, value: total count of that word in the corpus

sum_words = Counter()
for doc in data_list:
    sum_words.update(Counter(doc))

p_ji = dok_matrix((num_documents, vocabulary_size), dtype=np.float32)

This is a sequential process and very time-inefficient.

Is there a way to parallelize this process?


I found this page, however it is unclear to me how to implement this with my code.

Emil
  • 762
  • 6
  • 25
  • AFAIK, you cannot mutate a `dok_matrix` in parallel because that would either result in a race condition or an inefficient code if you use a critical section. You could try to work on different part of `p_ji` and then merge the result but the merge will probably be too slow unless you do no care about changing its type (a dense matrix or a semi-dense one). I am not sure Python is the right tool for that here. That being said, can you provide a more reproducible code so we can more easily help you? – Jérôme Richard Mar 19 '21 at 19:22
  • Thank you for you comment @JérômeRichard. Why do you think this would result in a race condition? There seems to be no sequential dependencies among computations. Sure, I can send some code. What code do you want me to provide? – Emil Mar 19 '21 at 21:36
  • `dok_matrix` is a sparse matrix and so it is stored in a compact way likely incompatible with using multiple threads. I guess that Scipy does not support parallel accesses on this type (like most Python library) and so that the Python GIL will sadly prevent any parallel access anyway. Regarding the code, it would be great to have something we can run ourselves (a simple/minimal working prototype). – Jérôme Richard Mar 19 '21 at 21:59
  • I added more code, so you should be able to run this yourself. Right now, all you need is `data_list`, which is a list of lists containing tokenized texts in each sub-listformatted as follows: `data_list = [['This', 'is','text','one'],['This', 'is','text','two'], ['This', 'is','text','three']]`. Note that the order in which I added the code here is different then how you run it, as the initial code calls variables that havent been initialized yet, in the code above. The current structure is best to keep the focus on this question's topic. – Emil Mar 20 '21 at 15:59

0 Answers0