I have a document-term matrix of size 4280 x 90140 (there are 4280 documents and 90140 unique words) called sparse_dtm
, represented as a scipy
dok_matrix
.
For each word j
in document i
, I want to calculate the probability. This is calculated as the count of word j
in document i
, divided by the total count of word j
in all documents. To do so, I have the following code:
for j in range(num_documents):
row_counts = sparse_dtm.getrow(j).toarray()[0]
word_index = row_counts.nonzero()[0]
non_zero_row_counts = row_counts[row_counts != 0]
for i, count in enumerate(non_zero_row_counts):
word = index_to_word[word_index[i]]
prob_ji = count/sum_words[word]
p_ji[j,word_index[i]] = prob_ji
With:
index_to_word
= dictionary with: key: index of a word, value: word
word_to_index
= dictionary with: key: word, value: index of a word
vocabulary = set()
for text in data_list:
vocabulary.update(text)
word_to_index = dict()
index_to_word = dict()
for i, word in enumerate(vocabulary):
word_to_index[word] = i
index_to_word[i] = word
sum_words
= dictionary with: key: word, value: total count of that word in the corpus
sum_words = Counter()
for doc in data_list:
sum_words.update(Counter(doc))
p_ji = dok_matrix((num_documents, vocabulary_size), dtype=np.float32)
This is a sequential process and very time-inefficient.
Is there a way to parallelize this process?
I found this page, however it is unclear to me how to implement this with my code.