4

As a beginner in NLP, I am trying to find the best way to cluster single words with unsupervised clustering, specifically where the number of clusters k is not known in advance. I have a group of words that contains clusters of words are very similar to each other (off by one or two letters) - by this I mean cosine similarity (not semantic) - I would like to be able to find the number of these clusters in the group without defining k in advance.

To take a basic example, I have tried using Levenshtein Distance, which takes the argument k in advance:

#Levenshtein Distance

str = c('foo', 'food', 'fo', 'ten', 'zen')
d  <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=2)

The algorithm performs well, but k is necessary to know.

enter image description here

Is there a good algorithm for clustering words? Most of the documentation I've come across uses td-idf and pairwise distances in sentences, but this is a much simpler problem and nothing really addresses just clustering groups of single words without knowing k. Any suggestions would be appreciated!

the_darkside
  • 5,688
  • 7
  • 36
  • 83

0 Answers0