0

i am implementing Latent Semantic Analysis LSA, using eclipse Mars, java 8, and spark spark-assembly-1.6.1-hadoop2.4.0.jar I passed the documents as tokens , then got SVD and so on

   HashingTF hf = new HashingTF(hashingTFSize);
    JavaRDD<Vector> ArticlesAsV = hf.transform(articles.map(x->x.tokens));
  JavaRDD<Vector> ArticlesTFIDF = idf.fit(ArticlesAsV).transform(ArticlesAsV);
RowMatrix matTFIDF = new RowMatrix(ArticlesTFIDF.rdd());
   double rCond= 1.0E-9d;
    int k = 50;
    SingularValueDecomposition<RowMatrix, Matrix> svd =  matTFIDF.computeSVD(k, true, rCond);

every thing works perfectly, except for one, that is when i try to get the indices of the terms from the hashingTF

int index = hf.indexOf(term);

i found that there are many terms that have the same index,those are some i got

0 : Term
1 : all
1 : next
2 : tt
3 : the
7 : document
9 : such
9 : matrix
11 : document
11 : about
11 : each
12 : function
12 : chance
14 : this
14 : provides
means that, when i try to get the vector of term to something with it, i may get the vector of another one with the same index, i did it after lemmatization and removing the stop word, but still got the same error, is there anything that i missed, or an error with the components (e.g MLip) that need to be updated ; how can i keep a unique for each term.

Yas
  • 21
  • 5

1 Answers1

2

Spark class HashingTF utilizes the hashing trick.

A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e., the number of buckets of the hash table. The default feature dimension is 2^20=1,048,576.

So groups of terms can have the same index.

Relative to the comments below, if you need of all terms you can use CountVectorizer instead of HashingTF. CountVectorizer can also be used to get term frequency vectors. To use CountVectorizer and subsequently IDF you must use DataFrame instead of JavaRDD, because CountVectorizer is supported only in ml package.

This is an example of DataFrame with columns id and words:

id | words
---|----------  
0  | Array("word1", "word2", "word3")  
1  | Array("word1", "word2", "word2", "word3", "word1")

So If you translate articles JavaRDD to DataFrame with columns id and words where each row is a bag of words from a sentence or document, you can compute TfIdf with a code like this:

CountVectorizerModel cvModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("rawFeatures")
  .setVocabSize(100000) // <-- Specify the Max size of the vocabulary.
  .setMinDF(2) // Specifies the minimum number of different documents a term must appear in to be included in the vocabulary.
  .fit(df); 

  DataFrame featurizedData = cvModel.transform(articles);

  IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
  IDFModel idfModel = idf.fit(featurizedData);
Umberto Griffo
  • 901
  • 6
  • 12
  • I got it, i set the size to the max ()of 1048576, and it worked for 14696 unique terms and i got 14453 unique indicies missing 243 terms, but will it do the same for 671333 unique terms. since i have a bigger size of data to test on, will it miss a many terms?? – Yas Sep 07 '16 at 14:18
  • for 671333 i got 489302, missing 185031, still not good, i need an index for every term, the missing terms could be important ones, any way to work it around??? – Yas Sep 07 '16 at 14:33
  • You could only enlarge the max size more than 1048576 using the constructor **HashingTF(numFeatures: Int)** and hope that It generates few collision. Even If HashingTF missing many terms there are [two reasons It still works](http://blog.someben.com/2013/01/hashing-lang/) – Umberto Griffo Sep 07 '16 at 14:36
  • but, if i am going to find similarities between a new short doc with few words, i think it will affect the results, with new large docs with many words we can neglect it, but i am working on short ones, do you think it will get acceptable results ??(again: input docs are large, test docs are short) – Yas Sep 07 '16 at 14:42
  • A possible Work Around is to use [CountVectorizer](https://spark.apache.org/docs/latest/ml-features.html#countvectorizer) instead of HashingTF. CountVectorizer can also be used to get term frequency vectors. – Umberto Griffo Sep 07 '16 at 14:57
  • can i use CountVectorizer to computer tfidf. For the hashingTF i used this code JavaRDD ArticlesAsV = hf.transform(articles.map(x->x.tokens)); IDF idf = new IDF(); JavaRDD ArticlesTFIDF = idf.fit(ArticlesAsV).transform(ArticlesAsV); – Yas Sep 07 '16 at 15:13
  • I updated the answer because the comment's characters are limitated. – Umberto Griffo Sep 07 '16 at 15:59
  • @ Umberto. thank you, but i wounder if it can support 400K of documents, and 670K of terms ?? – Yas Sep 08 '16 at 19:25
  • @Yas It depends your Spark cluster's size. Anyway Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. – Umberto Griffo Sep 09 '16 at 08:37
  • @ Umberto. thanks, and yes, i think it will work, but slower for a small cluster – Yas Sep 09 '16 at 21:53