I am doing a text analysis with R package tm.
My code is based on this link: https://www.r-bloggers.com/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know/
The text-files I load are only 4800 kB. The text files are a 10% sample of the original files I want to analyze.
My code is:
library(tm)
library(wordcloud)
library(SnowballC)
library(textmineR)
library(RWeka)
blogssub <- readLines("10kblogs.txt")
newssub <- readLines("10knews.txt")
tweetssub <- readLines("10ktwitter.txt")
corpussubset <- c(blogssub,newssub,tweetssub)
cpsub <- corpussubset
cpsubclean <- VCorpus(VectorSource(cpsub))
# make ngrams
unigram<- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
options(mc.cores=1) #hangs if you dont include this option on Mac OS
tdmuni<- TermDocumentMatrix(cpsubclean, control=list(tokenize=unigram))
m <- as.matrix(tdmuni)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
The code gives the following error: "Cannot allocate vector of size 12.3 Gb"
The error is caused by line: m <- as.matrix(tdmuni)
Can it be the case that my code is not efficient in some way? I am suprised that such a huge vector is allocated of 12.3 Gb since the orginal textfiles are only 4800 kB.
Thanks a lot!