0

I am a beginner at data science, and I am working on a text analytics/sentiment analysis project with tweets. what i have been trying to do is to perform some dimension reduction on my tweets training set, and feed the training set into a NaiveBayes learner, and use the learned NaiveBayes to predict the sentiment on the testing tweet set.

I have been following the steps in this article:

http://www.analyticskhoj.com/data-mining/text-analytics-part-iv-cluster-analysis-on-terms-and-documents-using-r/

their explanation is kind of too brief for a beginner like me.

I have used the lsa() to create a, what's labeled as "Large LSAspace (3 elements)" in RStudio. And following their example, I've created 3 more data frames:

lsa.train.tk = as.data.frame(lsa.train$tk)
lsa.train.dk = as.data.frame(lsa.train$dk)
lsa.train.sk = as.data.frame(lsa.train$sk)

when i view the lsa.train.tk data, it looks like this (lsa.train.dk looks pretty similar to this matrix):

enter image description here

and my lsa.train.sk looks like following:

enter image description here

my question is, how do i interpret such information? How can i utilize this information to create something that I can feed into my NaiveBayes learner? I tried just using the lsa.train.sk for the NaiveBayes learner, but I cannot think of any good explanation that can justify what I've tried. Any help would be much appreciated!

EDIT: What I've done so far:

  1. making everything into term document matrix
  2. pass in the matrix into the NaiveBayes learner
  3. predict using the learned algorithm

my problems are:

  1. accuracy is only 50%... and I realized that it labels everything as positive sentiment (so I could have gotten 1% accuracy if my test set only contains negative sentiment tweets).

  2. current code is not scalable. since it utilizes large matrices, I can only handle up to 3.5k rows of data. more than that, my computer would crash. thus I wanted to do a dimensional reduction so that I can handle up to more data (such as 10k or 100k rows of tweets)

alwaysaskingquestions
  • 1,327
  • 4
  • 18
  • 35

0 Answers0