38

I am attempting to apply k-means on a set of high-dimensional data points (about 50 dimensions) and was wondering if there are any implementations that find the optimal number of clusters.

I remember reading somewhere that the way an algorithm generally does this is such that the inter-cluster distance is maximized and intra-cluster distance is minimized but I don't remember where I saw that. It would be great if someone can point me to any resources that discuss this. I am using SciPy for k-means currently but any related library would be fine as well.

If there are alternate ways of achieving the same or a better algorithm, please let me know.

Legend
  • 104,480
  • 109
  • 255
  • 385
  • This might be more appropriate for the [Theoretical Computer Science Stack Exchange](http://cstheory.stackexchange.com/), since it not a question about implementation so much as theory. – gotgenes Jul 07 '11 at 19:09
  • 2
    ...and http://stackoverflow.com/questions/6353537/k-means-algorithm and http://stackoverflow.com/questions/6212690/how-to-optimal-k-in-k-means-algorithm This question gets asked quite a lot – Stompchicken Jul 08 '11 at 10:01
  • I've answered a similar Q with half a dozen methods (using `R`) over here: stackoverflow.com/a/15376462/1036500 – Ben May 13 '13 at 04:53
  • Maybe you should find cluster centers with subtractive clustering ? Basic concept of this algorithm was presented in: [link](http://www.mathworks.com/help/fuzzy/subclust.html) it is for matlab but should be good enough. – Bartek S Jul 08 '13 at 14:54

7 Answers7

16

One approach is cross-validation.

In essence, you pick a subset of your data and cluster it into k clusters, and you ask how well it clusters, compared with the rest of the data: Are you assigning data points to the same cluster memberships, or are they falling into different clusters?

If the memberships are roughly the same, the data fit well into k clusters. Otherwise, you try a different k.

Also, you could do PCA (principal component analysis) to reduce your 50 dimensions to some more tractable number. If a PCA run suggests that most of your variance is coming from, say, 4 out of the 50 dimensions, then you can pick k on that basis, to explore how the four cluster memberships are assigned.

Alex Reynolds
  • 91,635
  • 50
  • 223
  • 320
  • 7
    What is the link between the number of dimensions and the number of clusters? I can easily build a 1 dimensional distribution with k clusters for arbitrary K. – Rob Neuhaus Jul 08 '11 at 19:36
  • 3
    "If the memberships are roughly the same" -- this assumes the data is divided *evenly* into clusters, which is quite a strong assumption. – Fred Foo Jul 08 '11 at 23:54
  • What do you mean by "the same cluster memberships"? Do you compare the clustering on the training folds with the clustering on the test fold? If so, I'm not sure how you can compare them, since they have completely non overlapping data points. – max May 17 '16 at 06:07
9

Take a look at this wikipedia page on determining the number of clusters in a data set.

Also you might want to try Agglomerative hierarchical clustering out. This approach does not need to know the number of clusters, it will incrementally form clusters of cluster till only one exists. This technique also exists in SciPy (scipy.cluster.hierarchy).

Kevin Jalbert
  • 2,815
  • 3
  • 25
  • 34
4

One interesting approach is that of evidence accumulation by Fred and Jain. This is based on combining multiple runs of k-means with a large number of clusters, aggregating them into an overall solution. Nice aspects of the approach include that the number of clusters is determined in the process and that the final clusters don't have to be spherical.

Michael J. Barber
  • 22,744
  • 8
  • 61
  • 84
1

There are visualization that should hint good parameters. For k-means you could visualize several runs with different k using Graphgrams (see the WEKA graphgram package - best obtained by the package manager or here. An introduction and examples can also be found here.

0

If the cluster number is unknow, why not use Hierarchical Clustering instead?

At the begining, every isolated one is a cluster, then every two cluster will be merged if their distance is lower than a threshold, the algorithm will end when no more merger goes.

The Hierarchical clustering algorithm can carry out a suitable "K" for your data.

Hamid Rohani
  • 1,370
  • 2
  • 21
  • 31
Luna_one
  • 31
  • 3
0

One way to do it is to run k-means with large k (much larger than what you think is the correct number), say 1000. then, running mean-shift algorithm on the these 1000 point (mean shift uses the whole data but you will only "move" these 1000 points). mean shift will find the amount of clusters then. Running mean shift without the k-means before is a possibility but it is just too slow usually O(N^2*#steps), so running k-means before will speed things up: O(NK#steps)

tal
  • 764
  • 5
  • 17
0

You should also make sure that each dimension is in fact independent. Many so called multi-dimensional datasets have multiple representations of the same thing.

It is not wrong to have these in your data. It is wrong to use multiple versions of the same thing as support for a cluster argument.

http://en.wikipedia.org/wiki/Cronbach's_alpha

Michael
  • 557
  • 3
  • 13