6

I want to use K-Means Clustering for my dataset. I am using the kmeans() function in R for doing this.

 k<-kmeans(data,centers=3)
 plotcluster(m,k$cluster)

However i am not sure what is the correct value of K for this function. I want to try using the Elbow Method for this. Are there any packages in R which perform clustering using the Elbow Method for finding the optimum number of clusters.

Community
  • 1
  • 1
user1946152
  • 355
  • 1
  • 8
  • 22
  • 2
    From the wikipedia article: "This "elbow" cannot always be unambiguously identified." I think there is some subjectivity in this method, which makes an implementation difficult. – Roland Aug 04 '13 at 12:14
  • 2
    possible duplicate of [Cluster analysis in R: determine the optimal number of clusters](http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters) – Waldir Leoncio Apr 10 '14 at 18:37

5 Answers5

8

There are two question mixed up here. One is how to find a change point on a curve, and the other is about how to quantify the quality of fit when using k-means to classify data. However, the cluster-analysis folks seem to lump these two questions together. Don't be afraid of looking into other curve-fit / change point methods using whichever fit metric seems most appropriate to your case.

I know the 'elbow' method your linked to is a specific method, but you might be interested in something similar that looks for the 'knee' in the Bayesian Information Criterion (BIC). The kink in BIC versus the number of clusters (k) is the point at which you can argue that increasing BIC by adding more clusters is no longer beneficial, given the extra computational requirements of the more complex solution. There is a nice method that detects the optmimum number of clusters from the change in sign of the second derivative of the BIC. See e.g.

Zhao, Q., V. Hautamaki, and P. Franti 2008a: Knee point detection in BIC for detecting the number of clusters. Advanced Concepts for Intelligent Vision Systems, J. Blanc-Talon, S. Bourennane, W. Philips, D. Popescu, and P. Scheunders, Eds., Springer Berlin / Heidelberg, Lecture Notes in Computer Science, Vol. 5259, 664–673, doi:10.1007/978-3-540-88458-3 60.

Zhao, Q., M. Xu, and P. Franti, 2008b: Knee point detection on bayesian information criterion. Tools with Artificial Intelligence, 2008. ICTAI ’08. 20th IEEE Inter- national Conference on, Vol. 2, 431 –438, doi:10.1109/ ICTAI.2008.154

You might be interested in an automated application of this to weather data, reported in http://journals.ametsoc.org/doi/abs/10.1175/JAMC-D-11-0227.1

See also Finding the best trade-off point on a curve for an excellent discussion of the general approach.

One final observation: make sure that you are consistent in your logarithms. Different communities use different notation, and this can be a source of error when comparing results.

Community
  • 1
  • 1
Andy Clifton
  • 4,577
  • 2
  • 30
  • 46
3

Yes, you can find the best number of clusters using Elbow method, but I found it troublesome to find the value of clusters from elbow graph using script. You can observe the elbow graph and find the elbow point yourself, but it was lot of work finding it from script.

So another option is to use Silhouette Method to find it. The result from Silhouette completely comply with result from Elbow method.

Here`s what I did.

#Dataset for Clustering
n = 150
g = 6 
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), 
                y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
mydata<-d
#Plot 3X2 plots
attach(mtcars)
par(mfrow=c(3,2))

#Plot the original dataset
plot(mydata$x,mydata$y,main="Original Dataset")

#Scree plot to deterine the number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
  for (i in 2:15) {
    wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
}   
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward") 
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters 
rect.hclust(fit, k=5, border="red")

#Silhouette analysis for determining the number of clusters
library(fpc)
asw <- numeric(20)
for (k in 2:20)
  asw[[k]] <- pam(mydata, k) $ silinfo $ avg.width
k.best <- which.max(asw)

cat("silhouette-optimal number of clusters:", k.best, "\n")
plot(pam(d, k.best))

# K-Means Cluster Analysis
fit <- kmeans(mydata,k.best)
mydata 
# get cluster means 
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, clusterid=fit$cluster)
plot(mydata$x,mydata$y, col = fit$cluster, main="K-means Clustering results")

Hope it helps!!

Udeep Shakya
  • 499
  • 4
  • 4
  • 1
    Some illustrations of the results and a little more explanation of the thought process would turn this comprehensive answer into a great answer! – Andy Clifton Aug 05 '13 at 17:45
2

The GMD package provides a Elbow method for Clustering evaluation see: (Page 7) http://cran.r-project.org/web/packages/GMD/GMD.pdf

for a nice example see: http://www.inside-r.org/packages/cran/GMD/docs/elbow

hth

ben

holzben
  • 1,400
  • 15
  • 22
2

My experience is that you can't automate this---you need to make the plot and check for the elbow. Here's some nice examples: Cluster analysis in R: determine the optimal number of clusters

Community
  • 1
  • 1
stackoverflax
  • 946
  • 1
  • 9
  • 22
  • I tried doing that..but the graph i got is very different from the one in the wikipedia article...it is very difficult to find out the elbow point from it – user1946152 Aug 04 '13 at 18:02
  • 1
    you can automate anything - the issue is how reliably that automation finds the knee :) – Andy Clifton Aug 04 '13 at 20:38
  • 2
    Hi again. If the elbow isn't obvious in the graph than that's really an indication that there isn't one "right" answer for the number of clusters, k. You can try other metrics (AIC/BIC) or other clustering methods. Bottom-line may be, however, that you need a non-statistical method for choosing k (e.g. subject-matter expertise). (p.s. go ahead and post the graph if it's something you can share!) – stackoverflax Aug 06 '13 at 21:51
0

I tried the GMD package. For some reason it took lot of time and came back with an error in this step Error in elbow.batch(css.obj) :    A good k' is not available with provided inc.thres and ev.thres; please make adjustment, e.g. decreaseev.thres', increase inc.thres' or increasek'.

So this is my solution for getting the optimum K (Its running on a 54 column dataset (all integers) with 6000+ rows

mydata <- read.csv(file.choose(), header = TRUE)
mydata_scaled <- as.data.frame(lapply(mydata, scale))
mss <- (nrow(mydata_scaled)-1)*sum(apply(mydata_scaled,2,var))
for (i in 2:15) mss[i] <- sum(kmeans(mydata_scaled,centers=i)$withinss)
plot(1:15, mss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

hope that helps

Jim G.
  • 14,056
  • 19
  • 94
  • 153
Sanjay Subramanian
  • 1,419
  • 14
  • 9