i am implementing Latent Semantic Analysis LSA, using eclipse Mars, java 8, and spark spark-assembly-1.6.1-hadoop2.4.0.jar I passed the documents as tokens , then got SVD and so on
HashingTF hf = new HashingTF(hashingTFSize);
JavaRDD<Vector> ArticlesAsV = hf.transform(articles.map(x->x.tokens));
JavaRDD<Vector> ArticlesTFIDF = idf.fit(ArticlesAsV).transform(ArticlesAsV);
RowMatrix matTFIDF = new RowMatrix(ArticlesTFIDF.rdd());
double rCond= 1.0E-9d;
int k = 50;
SingularValueDecomposition<RowMatrix, Matrix> svd = matTFIDF.computeSVD(k, true, rCond);
every thing works perfectly, except for one, that is when i try to get the indices of the terms from the hashingTF
int index = hf.indexOf(term);
i found that there are many terms that have the same index,those are some i got
0 : Term
1 : all
1 : next
2 : tt
3 : the
7 : document
9 : such
9 : matrix
11 : document
11 : about
11 : each
12 : function
12 : chance
14 : this
14 : provides
means that, when i try to get the vector of term to something with it, i may get the vector of another one with the same index, i did it after lemmatization and removing the stop word, but still got the same error, is there anything that i missed, or an error with the components (e.g MLip) that need to be updated ; how can i keep a unique for each term.