why does as.matrix result in memory overload while running text mining in R

Question

I am doing a text analysis with R package tm.

My code is based on this link: https://www.r-bloggers.com/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know/

The text-files I load are only 4800 kB. The text files are a 10% sample of the original files I want to analyze.

My code is:

library(tm)
library(wordcloud)
library(SnowballC)
library(textmineR)
library(RWeka)

blogssub <- readLines("10kblogs.txt")
newssub <- readLines("10knews.txt")
tweetssub <- readLines("10ktwitter.txt")

corpussubset <- c(blogssub,newssub,tweetssub)
cpsub <- corpussubset

cpsubclean <- VCorpus(VectorSource(cpsub))

# make ngrams
unigram<- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))

options(mc.cores=1) #hangs if you dont include this option on Mac OS

tdmuni<- TermDocumentMatrix(cpsubclean, control=list(tokenize=unigram))

m <- as.matrix(tdmuni)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

The code gives the following error: "Cannot allocate vector of size 12.3 Gb"

The error is caused by line: m <- as.matrix(tdmuni)

Can it be the case that my code is not efficient in some way? I am suprised that such a huge vector is allocated of 12.3 Gb since the orginal textfiles are only 4800 kB.

Thanks a lot!

Most text mining matrices are very sparse, and are thus represented with a sparse matrix type (from the `Matrix` package; see `?Matrix::Matrix`). If you coerce one of these matrices to a dense (non-sparse) form with the 0s filled in, it can get huge. — alistaire, Jun 16 '18 at 19:20
@alistaire thanks. Do you mean i have to load package Matrix and replace as.matrix by a command from package Matrix? — user2165379, Jun 16 '18 at 19:27
You could, yes. In this case, I'd probably use [`tidytext::tidy`](https://www.tidytextmining.com/dtm.html) instead, though. — alistaire, Jun 16 '18 at 19:29
@alistaire thank you. I will try to implement this and let you know the outcome. — user2165379, Jun 16 '18 at 19:43
it seems you are coercing to a matrix as you want to use rowSums - if so, the `Matrix` package has a `rowSums` function specific for sparse matrices. — user2957945, Jun 16 '18 at 20:33
@user2957945 thanks. I want to make the barplot of frequencies, as shown at the end of the r-bloggers link. — user2165379, Jun 16 '18 at 20:40
okay, sorry, `Matrix` is the wrong sparse matrix package to use - use `slam`. So using the example you link to , try `library(slam) ; dtm — user2957945, Jun 16 '18 at 21:39
@user2957945 thanks, what is the slam-command in the code you provided? — user2165379, Jun 16 '18 at 21:48
`row_sums` instead of `rowSums` : this allows you do calculate the sums directly on the `dtm` matrix, without converting to a dense matrix (ie no need to use `as.matrix`), so should *hopefully* avoid the memory problems . (so try `library(slam) ; v — user2957945, Jun 16 '18 at 21:50
@user2957945 thanks, great. I will implement it and let you know the result! — user2165379, Jun 16 '18 at 21:55
@alistaire i have read the documentation from tidy, although it was pretty complicated to me. I have solved it with the slam package. Thanks. — user2165379, Jun 17 '18 at 12:40
`tidy` is doing the row sums for you, taking you straight to the data frame. Look at the data frame it returns. — alistaire, Jun 18 '18 at 16:23
@alistaire i tried sort too with the same problem. Will using dplyr::arrange solve this problem as you mentioned earlier? thanks — user2165379, Jun 23 '18 at 14:25
To use `order`, pull a vector out of the data frame, and then use the result to reorder the rows, e.g. `mtcars[order(mtcars$mpg), ]`. The equivalent dplyr would look like `mtcars %>% arrange(mpg)`. — alistaire, Jun 23 '18 at 14:28
@alistaire thanks. If i pull out a vector, don't i miss the connection with the other columns in that case (since i only sort the vector and not the complete table)? — user2165379, Jun 23 '18 at 14:31
Read `?order`. It doesn't sort anything: It returns a vector of indices with which to rearrange something else so as to sort it. So no. — alistaire, Jun 23 '18 at 14:40
"How to sort a data frame" is a question that is answered in lots and lots of places on the internet already, e.g. `https://stackoverflow.com/questions/1296646/how-to-sort-a-dataframe-by-multiple-columns` — alistaire, Jun 23 '18 at 15:42

why does as.matrix result in memory overload while running text mining in R

0 Answers0

Linked