39

On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the distortion as they call it, is calculated.

More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph.

Assuming that I have the following points with their associated centroids, what is a good way of calculating this measure?

points = numpy.array([[ 0,  0],
       [ 0,  1],
       [ 0, -1],
       [ 1,  0],
       [-1,  0],
       [ 9,  9],
       [ 9, 10],
       [ 9,  8],
       [10,  9],
       [10,  8]])

kmeans(pp,2)
(array([[9, 8],
   [0, 0]]), 0.9414213562373096)

I am specifically looking at computing the 0.94.. measure given just the points and the centroids. I am not sure if any of the inbuilt methods of scipy can be used or I have to write my own. Any suggestions on how to do this efficiently for large number of points?

In short, my questions (all related) are the following:

  • Given a distance matrix and a mapping of which point belongs to which cluster, what is a good way of computing a measure that can be used to draw the elbow plot?
  • How would the methodology change if a different distance function such as cosine similarity is used?

EDIT 2: Distortion

from scipy.spatial.distance import cdist
D = cdist(points, centroids, 'euclidean')
sum(numpy.min(D, axis=1))

The output for the first set of points is accurate. However, when I try a different set:

>>> pp = numpy.array([[1,2], [2,1], [2,2], [1,3], [6,7], [6,5], [7,8], [8,8]])
>>> kmeans(pp, 2)
(array([[6, 7],
       [1, 2]]), 1.1330618877807475)
>>> centroids = numpy.array([[6,7], [1,2]])
>>> D = cdist(points, centroids, 'euclidean')
>>> sum(numpy.min(D, axis=1))
9.0644951022459797

I guess the last value does not match because kmeans seems to be dividing the value by the total number of points in the dataset.

EDIT 1: Percent Variance

My code so far (should be added to Denis's K-means implementation):

centres, xtoc, dist = kmeanssample( points, 2, nsample=2,
        delta=kmdelta, maxiter=kmiter, metric=metric, verbose=0 )

print "Unique clusters: ", set(xtoc)
print ""
cluster_vars = []
for cluster in set(xtoc):
    print "Cluster: ", cluster

    truthcondition = ([x == cluster for x in xtoc])
    distances_inside_cluster = (truthcondition * dist)

    indices = [i for i,x in enumerate(truthcondition) if x == True]
    final_distances = [distances_inside_cluster[k] for k in indices]

    print final_distances
    print np.array(final_distances).var()
    cluster_vars.append(np.array(final_distances).var())
    print ""

print "Sum of variances: ", sum(cluster_vars)
print "Total Variance: ", points.var()
print "Percent: ", (100 * sum(cluster_vars) / points.var())

And following is the output for k=2:

Unique clusters:  set([0, 1])

Cluster:  0
[1.0, 2.0, 0.0, 1.4142135623730951, 1.0]
0.427451660041

Cluster:  1
[0.0, 1.0, 1.0, 1.0, 1.0]
0.16

Sum of variances:  0.587451660041
Total Variance:  21.1475
Percent:  2.77787757437

On my real dataset (does not look right to me!):

Sum of variances:  0.0188124746402
Total Variance:  0.00313754329764
Percent:  599.592510943
Unique clusters:  set([0, 1, 2, 3])

Sum of variances:  0.0255808508714
Total Variance:  0.00313754329764
Percent:  815.314672809
Unique clusters:  set([0, 1, 2, 3, 4])

Sum of variances:  0.0588210052519
Total Variance:  0.00313754329764
Percent:  1874.74720416
Unique clusters:  set([0, 1, 2, 3, 4, 5])

Sum of variances:  0.0672406353655
Total Variance:  0.00313754329764
Percent:  2143.09824556
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6])

Sum of variances:  0.0646291452839
Total Variance:  0.00313754329764
Percent:  2059.86465055
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7])

Sum of variances:  0.0817517362176
Total Variance:  0.00313754329764
Percent:  2605.5970695
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8])

Sum of variances:  0.0912820650486
Total Variance:  0.00313754329764
Percent:  2909.34837831
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Sum of variances:  0.102119601368
Total Variance:  0.00313754329764
Percent:  3254.76309585
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Sum of variances:  0.125549475536
Total Variance:  0.00313754329764
Percent:  4001.52168834
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

Sum of variances:  0.138469402779
Total Variance:  0.00313754329764
Percent:  4413.30651542
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
mck
  • 33,250
  • 12
  • 23
  • 39
Legend
  • 104,480
  • 109
  • 255
  • 385

2 Answers2

53

The distortion, as far as Kmeans is concerned, is used as a stopping criterion (if the change between two iterations is less than some threshold, we assume convergence)

If you want to calculate it from a set of points and the centroids, you can do the following (the code is in MATLAB using pdist2 function, but it should be straightforward to rewrite in Python/Numpy/Scipy):

% data
X = [0 1 ; 0 -1 ; 1 0 ; -1 0 ; 9 9 ; 9 10 ; 9 8 ; 10 9 ; 10 8];

% centroids
C = [9 8 ; 0 0];

% euclidean distance from each point to each cluster centroid
D = pdist2(X, C, 'euclidean');

% find closest centroid to each point, and the corresponding distance
[distortions,idx] = min(D,[],2);

the result:

% total distortion
>> sum(distortions)
ans =
           9.4142135623731

EDIT#1:

I had some time to play around with this.. Here is an example of KMeans clustering applied on the 'Fisher Iris Dataset' (4 features, 150 instances). We iterate over k=1..10, plot the elbow curve, pick K=3 as number of clusters, and show a scatter plot of the result.

Note that I included a number of ways to compute the within-cluster variances (distortions), given the points and the centroids. The scipy.cluster.vq.kmeans function returns this measure by default (computed with Euclidean as a distance measure). You can also use the scipy.spatial.distance.cdist function to calculate the distances with the function of your choice (provided you obtained the cluster centroids using the same distance measure: @Denis have a solution for that), then compute the distortion from that.

import numpy as np
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt

# load the iris dataset
fName = 'C:\\Python27\\Lib\\site-packages\\scipy\\spatial\\tests\\data\\iris.txt'
fp = open(fName)
X = np.loadtxt(fp)
fp.close()

##### cluster data into K=1..10 clusters #####
K = range(1,10)

# scipy.cluster.vq.kmeans
KM = [kmeans(X,k) for k in K]
centroids = [cent for (cent,var) in KM]   # cluster centroids
#avgWithinSS = [var for (cent,var) in KM] # mean within-cluster sum of squares

# alternative: scipy.cluster.vq.vq
#Z = [vq(X,cent) for cent in centroids]
#avgWithinSS = [sum(dist)/X.shape[0] for (cIdx,dist) in Z]

# alternative: scipy.spatial.distance.cdist
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]

##### plot ###
kIdx = 2

# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS, 'b*-')
ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12, 
    markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')

# scatter plot
fig = plt.figure()
ax = fig.add_subplot(111)
#ax.scatter(X[:,2],X[:,1], s=30, c=cIdx[k])
clr = ['b','g','r','c','m','y','k']
for i in range(K[kIdx]):
    ind = (cIdx[kIdx]==i)
    ax.scatter(X[ind,2],X[ind,1], s=30, c=clr[i], label='Cluster %d'%i)
plt.xlabel('Petal Length')
plt.ylabel('Sepal Width')
plt.title('Iris Dataset, KMeans clustering with K=%d' % K[kIdx])
plt.legend()

plt.show()

elbow_curve scatter_plot


EDIT#2:

In response to the comments, I give below another complete example using the NIST hand-written digits dataset: it has 1797 images of digits from 0 to 9, each of size 8-by-8 pixels. I repeat the experiment above slightly modified: Principal Components Analysis is applied to reduce the dimensionality from 64 down to 2:

import numpy as np
from scipy.cluster.vq import kmeans
from scipy.spatial.distance import cdist,pdist
from sklearn import datasets
from sklearn.decomposition import RandomizedPCA
from matplotlib import pyplot as plt
from matplotlib import cm

##### data #####
# load digits dataset
data = datasets.load_digits()
t = data['target']

# perform PCA dimensionality reduction
pca = RandomizedPCA(n_components=2).fit(data['data'])
X = pca.transform(data['data'])

##### cluster data into K=1..20 clusters #####
K_MAX = 20
KK = range(1,K_MAX+1)

KM = [kmeans(X,k) for k in KK]
centroids = [cent for (cent,var) in KM]
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]

tot_withinss = [sum(d**2) for d in dist]  # Total within-cluster sum of squares
totss = sum(pdist(X)**2)/X.shape[0]       # The total sum of squares
betweenss = totss - tot_withinss          # The between-cluster sum of squares

##### plots #####
kIdx = 9        # K=10
clr = cm.spectral( np.linspace(0,1,10) ).tolist()
mrk = 'os^p<dvh8>+x.'

# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(KK, betweenss/totss*100, 'b*-')
ax.plot(KK[kIdx], betweenss[kIdx]/totss*100, marker='o', markersize=12, 
    markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
ax.set_ylim((0,100))
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Percentage of variance explained (%)')
plt.title('Elbow for KMeans clustering')

# show centroids for K=10 clusters
plt.figure()
for i in range(kIdx+1):
    img = pca.inverse_transform(centroids[kIdx][i]).reshape(8,8)
    ax = plt.subplot(3,4,i+1)
    ax.set_xticks([])
    ax.set_yticks([])
    plt.imshow(img, cmap=cm.gray)
    plt.title( 'Cluster %d' % i )

# compare K=10 clustering vs. actual digits (PCA projections)
fig = plt.figure()
ax = fig.add_subplot(121)
for i in range(10):
    ind = (t==i)
    ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='%d'%i)
plt.legend()
plt.title('Actual Digits')
ax = fig.add_subplot(122)
for i in range(kIdx+1):
    ind = (cIdx[kIdx]==i)
    ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='C%d'%i)
plt.legend()
plt.title('K=%d clusters'%KK[kIdx])

plt.show()

elbow_curve digits_centroids PCA_compare

You can see how some clusters actually correspond to distinguishable digits, while others don't match a single number.

Note: An implementation of K-means is included in scikit-learn (as well as many other clustering algorithms and various clustering metrics). Here is another similar example.

Amro
  • 121,265
  • 25
  • 232
  • 431
  • +1 Thank you for your explanation. From what you mentioned, the only point of confirmation I am looking for now is whether this distortion value is used for assessing the value of `k`. In the post here: http://stats.stackexchange.com/questions/9850/how-to-plot-data-output-of-clustering the author directly uses distortion but I could not really understand why he did that. Would you have any thoughts on this? – Legend Jul 11 '11 at 23:02
  • yes, there is a trade-off between minimizing the total within-cluster sum of squares (called *distortion* here) and minimizing the number of clusters. In other words we want a model that fits the data well (small distortion), but at the same time, we want the model to be as simple as possible (not complex with too many clusters). The elbow method is a simple heuristic to balance between the two. This answer also explains it well: http://stackoverflow.com/questions/1793532/how-do-i-determine-k-when-using-k-means-clustering/1793572#1793572 – Amro Jul 11 '11 at 23:24
  • Amro, nice. However Iris is tiny, extrapolating from it iffy. Running kmeans on the 1797 x 64 digits data from [scikits.learn](http://scikit-learn.sourceforge.net/modules/datasets.html), which ought to have 10 well-separated clusters :) I get for k = 7 .. 13: average distance point - cluster centre 27.7 26.2 25.3 26.2 24.6 24.5 24.1 . Knee at 10 ? – denis Jul 12 '11 at 17:54
  • @Denis: I added another example with the hand-written digits dataset – Amro Jul 13 '11 at 23:56
  • Amro, nice again, +1 (add to the scikits.learn examples ?) This shows clearly that k = 10 isn't much of an elbow, automatic elbowing is iffy. – denis Jul 14 '11 at 13:35
  • 1
    @Denis: The elbow method is a **heuristic** approach far from being perfect. Other methods exist such as AIC/BIC... Also you have to remember Kmeans is an **unsupervised** learning technique, meaning it has no idea what the actual classes of the data are. Instead it tries to naturally discover the clusters from the data itself. So if two digits look alike in the feature space, they might be grouped together as you saw in the example above. Also by using PCA, we have lost some information in favor of fewer dimensions... As you may have found out by now, clustering is a difficult task :) – Amro Jul 14 '11 at 13:53
6

A simple cluster measure:
1) draw "sunburst" rays from each point to its nearest cluster centre,
2) look at the lengths — distance( point, centre, metric=... ) — of all the rays.

For metric="sqeuclidean" and 1 cluster, the average length-squared is the total variance X.var(); for 2 clusters, it's less ... down to N clusters, lengths all 0. "Percent of variance explained" is 100 % - this average.

Code for this, under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means:

def distancestocentres( X, centres, metric="euclidean", p=2 ):
    """ all distances X -> nearest centre, any metric
            euclidean2 (~ withinss) is more sensitive to outliers,
            cityblock (manhattan, L1) less sensitive
    """
    D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|
    return D.min(axis=1)  # all the distances

Like any long list of numbers, these distances can be looked at in various ways: np.mean(), np.histogram() ... Plotting, visualization, is not easy.
See also stats.stackexchange.com/questions/tagged/clustering, in particular
How to tell if data is “clustered” enough for clustering algorithms to produce meaningful results?

Community
  • 1
  • 1
denis
  • 18,805
  • 6
  • 56
  • 79
  • +1 Thank you for your time and explanation! I made an attempt at coding what you explained in your post and added it at the end of my question. Could you please take a look at it when you have some spare time? – Legend Jul 11 '11 at 22:58
  • Sure, good enough. The real question is, how does that vary with k for your real data -- numbers please ? If k = say 5 and 6 are close, move on. – denis Jul 12 '11 at 13:40
  • I am guessing there is something wrong in my function. I have posted the observed values in my question below the function under EDIT 1. The percentages I am getting are exceeding 100% and going to thousands. I guess now I am pretty sure my implementation is wrong. – Legend Jul 12 '11 at 19:02