1

I am facing some issues with my k-means clustering results on Alteryx. I am trying to conduct topic modelling on my data set of around 5000 text descriptions. After data cleaning, parsing and removing stop words and common words, I created a Document Term Matrix of 20 words and around 5000 documents.

After running K-Means Clustering on Alteryx, no matter how many clusters I indicated, there will always be only 1 document in all clusters except one with all the rest. For example:

2 Clusters

  • Cluster 1: 19 words
  • Cluster 2: 1 word

3 Clusters

  • Cluster 1: 18 words
  • Cluster 2: 1 word
  • Cluster 3: 1 word

5 Clusters

  • Cluster 1: 16 words
  • Cluster 2: 1 word
  • Cluster 3: 1 word
  • Cluster 4: 1 word
  • Cluster 5: 1 word

This clustering behavior happens no matter how many clusters I indicated. Looking for some help to shed some light and identify if these results would mean my data has problems or if I did not use the correct settings?

Thanks in advance!

Adrian
  • 23
  • 4
  • Why only 20 words? Are you using tf-idf? Are you clustering words or documents? Some more details on your method might help. However, it might really just be the data... – user3658307 Oct 04 '18 at 20:25
  • hi @user3658307 i calculated the frequency of occurrence of every word in the data set and took the top 20 most commonly used words (after removing stop words and other industry lingo). I am not sure if that was a form of tf-idf? – Adrian Oct 05 '18 at 01:35
  • Are you implementing some particular algorithm from somewhere? Can you post more info, e.g. what are the 20 words, what are in the documents (e.g. books, newspapers; what categories are there, etc...)? That might help diagnose the problem. Also, I suggest understanding [tfidf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) if you are not familiar with it because that doesn't sound like what you are doing. It might help too. – user3658307 Oct 05 '18 at 02:31

1 Answers1

0

Did you look at your data after preprocessing?

Probably many documents are now empty, or contain just one word.

The is not much left except finding the common words.

Has QUIT--Anony-Mousse
  • 70,714
  • 12
  • 123
  • 184