K-means algorithm variation with equal cluster size

Question

I'm looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promising, but does not produce equally sized groups.

Is there a variation of this algorithm or a different one that allows for an equal count of members for all clusters?

See also: Group n points in k clusters of equal size

k-means clustering is NP-hard by itself. Maybe you can start changing the distance function till all points fall into equally sized groups, but I am afraid that it is not a convex optimization problem so you are up for some serious computation here. — Alexei Polkhanov, Mar 27 '11 at 21:40
Thanks everybody for their good answers. I have meanwhile taken a completely different approach for my initial problem, that doesn't involve clustering anymore. Thus I'm not able to judge which answer should be accepted, I'll just leave this open, hope you don't mind. — pixelistik, May 25 '12 at 09:32
@pixelistik Hi, could you please give an idea about the approach you have taken to solve this. I'm too trying to solve same problem. Any hint/suggestions will work. Thanks in advance. — Atendra, Feb 28 '18 at 10:58
Sorry, I can't be of help, I'm afraid. My approach does not involve clustering at all anymore. — pixelistik, Mar 01 '18 at 13:14
@Atendra there are implementations linked in the *many* answers below. Some seem to be outdated (python) others supposedly still work (ELKI), some you'd need to code yourself (my answer). Did you try any of them? — Has QUIT--Anony-Mousse, Jun 22 '18 at 07:06

Fred Foo · Answer 1 · 2011-03-27T22:36:29.370

16

This might do the trick: apply Lloyd's algorithm to get k centroids. Sort the centroids by descending size of their associated clusters in an array. For i = 1 through k-1, push the data points in cluster i with minimal distance to any other centroid j (i < j ≤ k) off to j and recompute the centroid i (but don't recompute the cluster) until the cluster size is n / k.

The complexity of this postprocessing step is O(k² n lg n).

edited Mar 27 '11 at 22:36

answered Mar 27 '11 at 21:49

Fred Foo

328,932
68
689
800

Thank you, this sound like a good idea for achieving equally sized groups in a second step. – pixelistik Mar 27 '11 at 22:52
2

I tried this solution without success, please see my related question http://stackoverflow.com/questions/8796682/group-n-points-in-k-clusters-of-equal-size – Pierre-David Belanger Jan 09 '12 at 23:52
1

Isn't Lloyd's algorithm on a discrete set the same as k-means? – Tom Lubitz Jul 21 '15 at 22:10

Erich Schubert · Answer 2 · 2019-02-19T20:52:19.663

The ELKI data mining framework has a tutorial on equal-size k-means.

This is not a particulary good algorithm, but it's an easy enough k-means variation to write a tutorial for and teach people how to implement their own clustering algorithm variation; and apparently some people really need their clusters to have the same size, although the SSQ quality will be worse than with regular k-means.

In ELKI 0.7.5, you can select this algorithm as tutorial.clustering.SameSizeKMeansAlgorithm.

score 5 · Answer 3 · answered Mar 28 '11 at 10:02

You can view the distances as defining a weighted graph. Quite a few graph partitioning algorithms are explicitly based on trying to partition the graph vertices into two sets of equal size. This appears in, for example, the Kernighan-Lin algorithm and in spectral graph partitioning using the Laplacian. To get multiple clusters, you can recursively apply the partitioning algorithm; there's a nice discussion of this at the link on spectral graph partitioning.

score 4 · Answer 4 · edited Jun 22 '18 at 07:00

In general, grouping points on a map into equally sized groups, by distance is a impossible mission in theory. Because grouping points into equally sized groups is in conflict with grouping points in clusters by distance.

see this plot: enter image description here

There are four points:

A.[1,1]
B.[1,2]
C.[2,2]
D.[5,5]

If we cluster these points into two cluster. Obviously, (A,B,C) will be cluster 1, D will be cluster 2. But if we need equally sized groups, (A,B) will be one cluster, (C,D) will be the other. This violates cluster rules because C is closer to center of (A,B) but it belongs to cluster (C,D). So requirement of cluster and equally sized groups can not be satisfied at the same time.

score 4 · Answer 5 · answered Jan 10 '12 at 20:42

Try this k-means variation:

Initialization:

choose k centers from the dataset at random, or even better using kmeans++ strategy
for each point, compute the distance to its nearest cluster center, and build a heap for this
draw points from the heap, and assign them to the nearest cluster, unless the cluster is already overfull. If so, compute the next nearest cluster center and reinsert into the heap

In the end, you should have a paritioning that satisfies your requirements of the +-1 same number of objects per cluster (make sure the last few clusters also have the right number. The first m clusters should have ceil objects, the remainder exactly floor objects.)

Iteration step:

Requisites: a list for each cluster with "swap proposals" (objects that would prefer to be in a different cluster).

E step: compute the updated cluster centers as in regular k-means

M step: Iterating through all points (either just one, or all in one batch)

Compute nearest cluster center to object / all cluster centers that are closer than the current clusters. If it is a different cluster:

If the other cluster is smaller than the current cluster, just move it to the new cluster
If there is a swap proposal from the other cluster (or any cluster with a lower distance), swap the two element cluster assignments (if there is more than one offer, choose the one with the largest improvement)
otherwise, indicate a swap proposal for the other cluster

The cluster sizes remain invariant (+- the ceil/floor difference), an objects are only moved from one cluster to another as long as it results in an improvement of the estimation. It should therefore converge at some point like k-means. It might be a bit slower (i.e. more iterations) though.

I do not know if this has been published or implemented before. It's just what I would try (if I would try k-means. there are much better clustering algorithms.)

score 3 · Answer 6 · answered Aug 14 '15 at 22:10

There is a cleaner post-processing, given cluster centroids. Let N be the number of items, K the number of clusters and S = ceil(N/K) maximum cluster size.

Create a list of tuples (item_id, cluster_id, distance)
Sort tuples with respect to distance
For each element (item_id, cluster_id, distance) in the sorted list of tuples:
- if number of elements in cluster_id exceeds S do nothing
- otherwise add item_id to cluster cluster_id.

This runs in O(NK lg(N)), should give comparable results to @larsmans solution and is easier to implement. In pseudo-python

dists = []
clusts = [None] * N
counts = [0] * K

for i, v in enumerate(items):
    dist = map( lambda x: dist(x, v), centroids )
    dd = map( lambda (k, v): (i, k, v), enumerate(dist) )
    dists.extend(dd)

dists = sorted(dists, key = lambda (x,y,z): z)

for (item_id, cluster_id, d) in dists:
    if counts[cluster_id] >= S:
        continue
    if clusts[item_id] == None:
        clusts[item_id] = cluster_id
        counts[cluster_id] = counts[cluster_id] + 1

score 3 · Answer 7 · edited Sep 25 '17 at 21:51

After reading this question and several similar ones, I created a python implementation of the same-size k-means using the Elki tutorial on https://elki-project.github.io/tutorial/same-size_k_means which utilizes scikit-learn's K-Means implementation for most of the common methods and familiar API.

My implementation is found here: https://github.com/ndanielsen/Same-Size-K-Means

The clustering logic is found in this function : _labels_inertia_precompute_dense()

user2341646 · Answer 8 · 2013-11-27T19:10:51.220

Recently I needed this myself for a not very large dataset. My answer, although it has a relatively long running time, is guaranteed to converge to a local optimum.

def eqsc(X, K=None, G=None):
    "equal-size clustering based on data exchanges between pairs of clusters"
    from scipy.spatial.distance import pdist, squareform
    from matplotlib import pyplot as plt
    from matplotlib import animation as ani    
    from matplotlib.patches import Polygon   
    from matplotlib.collections import PatchCollection
    def error(K, m, D):
        """return average distances between data in one cluster, averaged over all clusters"""
        E = 0
        for k in range(K):
            i = numpy.where(m == k)[0] # indeces of datapoints belonging to class k
            E += numpy.mean(D[numpy.meshgrid(i,i)])
        return E / K
    numpy.random.seed(0) # repeatability
    N, n = X.shape
    if G is None and K is not None:
        G = N // K # group size
    elif K is None and G is not None:
        K = N // G # number of clusters
    else:
        raise Exception('must specify either K or G')
    D = squareform(pdist(X)) # distance matrix
    m = numpy.random.permutation(N) % K # initial membership
    E = error(K, m, D)
    # visualization
    #FFMpegWriter = ani.writers['ffmpeg']
    #writer = FFMpegWriter(fps=15)
    #fig = plt.figure()
    #with writer.saving(fig, "ec.mp4", 100):
    t = 1
    while True:
        E_p = E
        for a in range(N): # systematically
            for b in range(a):
                m[a], m[b] = m[b], m[a] # exchange membership
                E_t = error(K, m, D)
                if E_t < E:
                    E = E_t
                    print("{}: {}<->{} E={}".format(t, a, b, E))
                    #plt.clf()
                    #for i in range(N):
                        #plt.text(X[i,0], X[i,1], m[i])
                    #writer.grab_frame()
                else:
                    m[a], m[b] = m[b], m[a] # put them back
        if E_p == E:
            break
        t += 1           
    fig, ax = plt.subplots()
    patches = []
    for k in range(K):
        i = numpy.where(m == k)[0] # indeces of datapoints belonging to class k
        x = X[i]        
        patches.append(Polygon(x[:,:2], True)) # how to draw this clock-wise?
        u = numpy.mean(x, 0)
        plt.text(u[0], u[1], k)
    p = PatchCollection(patches, alpha=0.5)        
    ax.add_collection(p)
    plt.show()

if __name__ == "__main__":
    N, n = 100, 2    
    X = numpy.random.rand(N, n)
    eqsc(X, G=3)

sclv · Answer 9 · 2011-03-27T22:18:34.567

2

Consider some form of recursive greedy merge -- each point begins as a singleton cluster and repeatedly merge the closest two such that such a merge doesn't exceed max. size. If you have no choice left but to exceed max size, then locally recluster. This is a form of backtracking hierarchical clustering: http://en.wikipedia.org/wiki/Hierarchical_clustering

edited Mar 27 '11 at 22:18

answered Mar 27 '11 at 22:12

sclv

37,544
6
89
194

1

Looks like a good start here. But, yeah, can you define "locally recluster"? Thank you. – Pierre-David Belanger Jan 10 '12 at 16:32

Eyal Shulman · Answer 10 · 2020-09-14T14:28:59.733

2

Just in case anyone wants to copy and paste a short function here you go - basically running KMeans then finding the minimal matching of points to clusters under the constraint of maximal points assigned to cluster (cluster size)

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment
import numpy as np


def get_even_clusters(X, cluster_size):
    n_clusters = int(np.ceil(len(X)/cluster_size))
    kmeans = KMeans(n_clusters)
    kmeans.fit(X)
    centers = kmeans.cluster_centers_
    centers = centers.reshape(-1, 1, X.shape[-1]).repeat(cluster_size, 1).reshape(-1, X.shape[-1])
    distance_matrix = cdist(X, centers)
    clusters = linear_sum_assignment(distance_matrix)[1]//cluster_size
    return clusters

edited Sep 14 '20 at 14:28

answered Sep 14 '20 at 14:23

Eyal Shulman

504
3
11

I guess, X here is a list of (x,y) values or coordinates. Is there a way to feed in a distance/weight matrix instead? – tcokyasar Feb 23 '21 at 11:06
1

X is a set of vectors represented as a matrix. sklearn's KMeans has a deprecated precompute_distances parameter which I think is what you're after. You can generally replace the KMeans part to find clusters in the way that suits you and keep the assignment part to make them of equal size – Eyal Shulman Feb 24 '21 at 10:11

denis · Answer 11 · 2012-01-10T11:49:57.930

Added Jan 2012: Better than postprocessing would be to keep cluster sizes roughly the same as they grow.
For example, find for each X the 3 nearest centres, then add X to the one with the best distance - λ clustersize.

A simple greedy postprocess after k-means may be good enough, if your clusters from k-means are roughly equal-sized.
(This approximates an assignment algorithm on the Npt x C distance matrix from k-means.)

One could iterate

diffsizecentres = kmeans( X, centres, ... )
X_centre_distances = scipy.spatial.distance.cdist( X, diffsizecentres )
    # or just the nearest few centres
xtoc = samesizeclusters( X_centre_distances )
samesizecentres = [X[xtoc[c]].mean(axis=0) for c in range(k)]
...

I'd be surprised if iterations changed the centres much, but it'll depend ™.

About how big are your Npoint Ncluster and Ndim ?

#!/usr/bin/env python
from __future__ import division
from operator import itemgetter
import numpy as np

__date__ = "2011-03-28 Mar denis"

def samesizecluster( D ):
    """ in: point-to-cluster-centre distances D, Npt x C
            e.g. from scipy.spatial.distance.cdist
        out: xtoc, X -> C, equal-size clusters
        method: sort all D, greedy
    """
        # could take only the nearest few x-to-C distances
        # add constraints to real assignment algorithm ?
    Npt, C = D.shape
    clustersize = (Npt + C - 1) // C
    xcd = list( np.ndenumerate(D) )  # ((0,0), d00), ((0,1), d01) ...
    xcd.sort( key=itemgetter(1) )
    xtoc = np.ones( Npt, int ) * -1
    nincluster = np.zeros( C, int )
    nall = 0
    for (x,c), d in xcd:
        if xtoc[x] < 0  and  nincluster[c] < clustersize:
            xtoc[x] = c
            nincluster[c] += 1
            nall += 1
            if nall >= Npt:  break
    return xtoc

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    from scipy.spatial import distance
    # http://docs.scipy.org/doc/scipy/reference/spatial.distance.html

    Npt = 100
    C = 3
    dim = 3
    seed = 1

    exec( "\n".join( sys.argv[1:] ))  # run this.py N= ...
    np.set_printoptions( 2, threshold=200, edgeitems=5, suppress=True )  # .2f
    random.seed(seed)
    np.random.seed(seed)

    X = np.random.uniform( size=(Npt,dim) )
    centres = random.sample( X, C )
    D = distance.cdist( X, centres )
    xtoc = samesizecluster( D )
    print "samesizecluster sizes:", np.bincount(xtoc)
        # Npt=100 C=3 -> 32 34 34

No big numbers: Npoint=180; NCluster=Npoint/9=20; Ndim=2 (geographical map, 2D) — pixelistik, Mar 29 '11 at 11:37

score 1 · Answer 12 · answered May 22 '20 at 18:29

Equal size k-means is a special case of a constrained k-means procedure where each cluster must have a minimum number of points. This problem can be formulated as a graph problem where the nodes are the points to be clustered, and each point has an edge to each centroid, where the edge weight is the squared euclidean distance to the centroid. It is discussed here:

Bradley PS, Bennett KP, Demiriz A (2000), Constrained K-Means Clustering. Microsoft Research.

A Python implementation is available here.

score 0 · Answer 13 · answered Mar 19 '15 at 01:14

Also look at K-d tree which partitions the data until each partitions' members are less than a BUCKET_SIZE which is an input to the algorithm.

This doesn't force the buckets/partitions to be exactly the same size but they'll be all less than the BUCKET_SIZE.

score 0 · Answer 14 · answered Sep 27 '17 at 09:07

I've been struggling on how to solve this question too. However, I realize that I have used the wrong keyword all this time. If you want the number of point result member to be same size, you are doing a grouping, not clustering anymore. I finally able to solve the problem using simple python script and postgis query.

For example, I have a table called tb_points which has 4000 coordinate point, and you want to divide it into 10 same size group, which will contain 400 coordinate point each. Here is the example of the table structure

CREATE TABLE tb_points (
  id SERIAL PRIMARY KEY,
  outlet_id INTEGER,
  longitude FLOAT,
  latitide FLOAT,
  group_id INTEGER
);

Then what you need to do are:

Find the first coordinate that will be your starting point
Find nearest coordinate from your starting point, order by distance ascending, limit the result by the number of your preferred member (in this case 400)
Update the result by updating the group_id column
Do 3 steps above 10 times for the rest of data, which group_id column is still NULL

This is the implementation in python:

import psycopg2

dbhost = ''
dbuser = ''
dbpass = ''
dbname = ''
dbport = 5432

conn = psycopg2.connect(host = dbhost,
       user = dbuser,
       password = dbpass,
       database = dbname,
       port = dbport)

def fetch(sql):
    cursor = conn.cursor()
    rs = None
    try:
        cursor.execute(sql)
        rs = cursor.fetchall()
    except psycopg2.Error as e:
        print(e.pgerror)
        rs = 'error'
    cursor.close()
    return rs

def execScalar(sql):
    cursor = conn.cursor()
    try:
        cursor.execute(sql)
        conn.commit()
        rowsaffected = cursor.rowcount
    except psycopg2.Error as e:
        print(e.pgerror)
        rowsaffected = -1
        conn.rollback()
    cursor.close()
    return rowsaffected


def select_first_cluster_id():
    sql = """ SELECT a.outlet_id as ori_id, a.longitude as ori_lon,
    a.latitude as ori_lat, b.outlet_id as dest_id, b.longitude as
    dest_lon, b.latitude as dest_lat,
    ST_Distance(CAST(ST_SetSRID(ST_Point(a.longitude,a.latitude),4326)
    AS geography), 
    CAST(ST_SetSRID(ST_Point(b.longitude,b.latitude),4326) AS geography))
    AS air_distance FROM  tb_points a CROSS JOIN tb_points b WHERE
    a.outlet_id != b.outlet_id and a.group_id is NULL and b.group_id is
    null order by air_distance desc limit 1 """
    return sql

def update_group_id(group_id, ori_id, limit_constraint):
    sql = """ UPDATE tb_points
    set group_id = %s
    where outlet_id in
    (select b.outlet_id
    from tb_points a,
    tb_points b
    where a.outlet_id = '%s'
    and a.group_id is null
    and b.group_id is null
    order by ST_Distance(CAST(ST_SetSRID(ST_Point(a.longitude,a.latitude),4326) AS geography),
    CAST(ST_SetSRID(ST_Point(b.longitude,b.latitude),4326) AS geography)) asc
    limit %s)
    """ % (group_id, ori_id, limit_constraint)
    return sql

def clustering():
    data_constraint = [100]
    n = 1
    while n <= 10:
        sql = select_first_cluster_id()
        res = fetch(sql)
        ori_id = res[0][0]

        sql = update_group_id(n, ori_id, data_constraint[0])
        print(sql)
        execScalar(sql)

        n += 1

clustering()

Hope it helps

Pierre-David Belanger · Answer 15 · 2016-12-20T19:57:29.503

0

May I humbly suggest that you try this project ekmeans.

A Java K-means Clustering implementation with an optional special equal option that apply an equal cardinality constraint on the clusters while remaining as spatially cohesive as possible.

It is yet experimental, so just be aware of the known bugs.

edited Dec 20 '16 at 19:57

answered Jan 12 '12 at 17:45

Pierre-David Belanger

994
1
9
18

Gigamegs · Answer 16 · 2019-12-12T02:33:21.487

-1

You want to take a look into space-filling-curve, for example a z-curve or a hilbert-curve. You can think of a space-filling-curve to reduce the 2-Dimensional problem to a 1-Dimensional problem. Although the sfc index is only a reorder of the 2-Dimensional data and not a perfect clustering of the data it can be useful when the solution has not to satisfied a perfect cluster and has to be compute fairly fast.

edited Dec 12 '19 at 02:33

answered Mar 27 '11 at 21:51

Gigamegs

12,342
7
31
71

Could you explain how to use a space filling curve to solve the problem? – Dr. belisarius Mar 28 '11 at 02:39

K-means algorithm variation with equal cluster size

16 Answers16

Linked

Related