2

I am doing a text analysis with R package tm.

My code is based on this link: https://www.r-bloggers.com/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know/

The text-files I load are only 4800 kB. The text files are a 10% sample of the original files I want to analyze.

My code is:

library(tm)
library(wordcloud)
library(SnowballC)
library(textmineR)
library(RWeka)

blogssub <- readLines("10kblogs.txt")
newssub <- readLines("10knews.txt")
tweetssub <- readLines("10ktwitter.txt")

corpussubset <- c(blogssub,newssub,tweetssub)
cpsub <- corpussubset

cpsubclean <- VCorpus(VectorSource(cpsub))

# make ngrams
unigram<- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))

options(mc.cores=1) #hangs if you dont include this option on Mac OS

tdmuni<- TermDocumentMatrix(cpsubclean, control=list(tokenize=unigram))

m <- as.matrix(tdmuni)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)    

The code gives the following error: "Cannot allocate vector of size 12.3 Gb"

The error is caused by line: m <- as.matrix(tdmuni)

Can it be the case that my code is not efficient in some way? I am suprised that such a huge vector is allocated of 12.3 Gb since the orginal textfiles are only 4800 kB.

Thanks a lot!

user2165379
  • 181
  • 1
  • 14
  • 1
    Most text mining matrices are very sparse, and are thus represented with a sparse matrix type (from the `Matrix` package; see `?Matrix::Matrix`). If you coerce one of these matrices to a dense (non-sparse) form with the 0s filled in, it can get huge. – alistaire Jun 16 '18 at 19:20
  • @alistaire thanks. Do you mean i have to load package Matrix and replace as.matrix by a command from package Matrix? – user2165379 Jun 16 '18 at 19:27
  • 1
    You could, yes. In this case, I'd probably use [`tidytext::tidy`](https://www.tidytextmining.com/dtm.html) instead, though. – alistaire Jun 16 '18 at 19:29
  • @alistaire thank you. I will try to implement this and let you know the outcome. – user2165379 Jun 16 '18 at 19:43
  • it seems you are coercing to a matrix as you want to use rowSums - if so, the `Matrix` package has a `rowSums` function specific for sparse matrices. – user2957945 Jun 16 '18 at 20:33
  • @user2957945 thanks. I want to make the barplot of frequencies, as shown at the end of the r-bloggers link. – user2165379 Jun 16 '18 at 20:40
  • okay, sorry, `Matrix` is the wrong sparse matrix package to use - use `slam`. So using the example you link to , try `library(slam) ; dtm – user2957945 Jun 16 '18 at 21:39
  • @user2957945 thanks, what is the slam-command in the code you provided? – user2165379 Jun 16 '18 at 21:48
  • `row_sums` instead of `rowSums` : this allows you do calculate the sums directly on the `dtm` matrix, without converting to a dense matrix (ie no need to use `as.matrix`), so should *hopefully* avoid the memory problems . (so try `library(slam) ; v – user2957945 Jun 16 '18 at 21:50
  • @user2957945 thanks, great. I will implement it and let you know the result! – user2165379 Jun 16 '18 at 21:55
  • 1
    @user2957945, this works fine with slam. Thank you! – user2165379 Jun 17 '18 at 12:39
  • @alistaire i have read the documentation from tidy, although it was pretty complicated to me. I have solved it with the slam package. Thanks. – user2165379 Jun 17 '18 at 12:40
  • Just pass in your DTM, and it'll give you a data frame: `d – alistaire Jun 17 '18 at 15:12
  • @alistaire thanks, i will try that too and let you know. – user2165379 Jun 17 '18 at 17:28
  • @alistaire , i tried m – user2165379 Jun 18 '18 at 10:28
  • `tidy` is doing the row sums for you, taking you straight to the data frame. Look at the data frame it returns. – alistaire Jun 18 '18 at 16:23
  • @alistaire i have changed the code to: m – user2165379 Jun 23 '18 at 09:53
  • `order` takes a vector (or vectors) – alistaire Jun 23 '18 at 14:21
  • @alistaire i tried sort too with the same problem. Will using dplyr::arrange solve this problem as you mentioned earlier? thanks – user2165379 Jun 23 '18 at 14:25
  • To use `order`, pull a vector out of the data frame, and then use the result to reorder the rows, e.g. `mtcars[order(mtcars$mpg), ]`. The equivalent dplyr would look like `mtcars %>% arrange(mpg)`. – alistaire Jun 23 '18 at 14:28
  • @alistaire thanks. If i pull out a vector, don't i miss the connection with the other columns in that case (since i only sort the vector and not the complete table)? – user2165379 Jun 23 '18 at 14:31
  • Read `?order`. It doesn't sort anything: It returns a vector of indices with which to rearrange something else so as to sort it. So no. – alistaire Jun 23 '18 at 14:40
  • @alistaire i have this line now which gives no error: v – user2165379 Jun 23 '18 at 15:31
  • @alistaire i have this line now which gives no error: v – user2165379 Jun 23 '18 at 15:37
  • "How to sort a data frame" is a question that is answered in lots and lots of places on the internet already, e.g. `https://stackoverflow.com/questions/1296646/how-to-sort-a-dataframe-by-multiple-columns` – alistaire Jun 23 '18 at 15:42

0 Answers0