6

I am working with GPS data (latitude, longitude). For density based clustering I have used DBSCAN in R.

Advantages of DBSCAN in my case:

  1. I don't have to predefine numbers of clusters
  2. I can calculate a distance matrix (using Haversine Distance Formula) and use that as input in dbscan

    library(fossil)
    dist<- earth.dist(df, dist=T) #df is dataset containing lat long values
    library(fpc)
    dens<-dbscan(dist,MinPts=25,eps=0.43,method="dist")
    

Now, when I look at the clusters, they are not meaningful. Some clusters have points which are more than 1km apart. I want dense clusters but not that big in size.

Different values of MinPts and eps are taken care of and I have also used k nearest neighbor distance graph to get an optimum value of eps for MinPts=25

What dbscan is doing is going to every point in my dataset and if point p has MinPts in its eps neighborhood it will make a cluster but at the same time it is also joining the clusters which are density reachable (which I guess are creating a problem for me).

It really is a big question, particularly "how to reduce size of a cluster without affecting its information too much", but I will write it down as the following points:

  1. How to remove border points in a cluster? I know which points are in which cluster using dens$cluster, but how would I know if a particular point is core or border?
  2. Is cluster 0 always noise?
  3. I was under the impression that the size of a cluster would be comparable to eps. But that's not the case because density reachable clusters are combined together.
  4. Is there any other clustering method which has the advantage of dbscan but can give me more meaningful clusters?

OPTICS is another alternative but will it solve my issue?

Note: By meaningful I want to say closer points should be in a cluster. But points which are 1km or more apart should not be in the same cluster.

sau
  • 1,176
  • 2
  • 14
  • 31
  • Hi Sau, have you considered using K-mean clustering? – jinlong Dec 31 '13 at 13:03
  • K mean clustering needs to know how many clusters you want to make.Thats the drawback of it. Also I can not pass my distance matrix into k means. – sau Dec 31 '13 at 13:10
  • This seems like more of a conceptual question about the nature of different clustering algorithms, rather tha a programming how-to question; it belongs on [Cross Validated](http://stats.stackexchange.com/) (ie, stats.SE) instead of here. – gung - Reinstate Monica Dec 31 '13 at 14:56
  • There is `R` code for half a dozen methods for estimating suitable numbers of clusters for k-means over [here](http://stackoverflow.com/a/15376462/1036500). You do not really need to 'know' how many clusters your data have. – Ben Dec 31 '13 at 15:08
  • possible duplicate of [dbscan - setting limit on maximum cluster span](http://stackoverflow.com/questions/18547147/dbscan-setting-limit-on-maximum-cluster-span) – Has QUIT--Anony-Mousse Dec 31 '13 at 18:02
  • @Ben Yeah I have tried K Means and PAM. I have a different formula for calculation of distance between two points ACOS(SIN(lat1)*SIN(lat2)+COS(lat1)*COS(lat2)*COS(lon2-lon1))*6371 Km. Can't work with K means. Can work with PAM if I calculate distance matrix seperately – sau Dec 31 '13 at 21:05
  • I am working on a similar problem (given a list of latitude and longitude values, find clusters of maximum radius r, so that I can find the most popular cluster centroids). Did you ever find a suitable solution? – stackoverflowuser2010 Jun 05 '14 at 22:09

1 Answers1

7

DBSCAN doesn't claim the radius is the maximum cluster size.

Have you read the article? It's looking for arbitrarily shaped clusters; eps is just the core size of a point; roughly the size used for density estimation; any point within this radius of a core point will be part of a cluster.

This makes it essentially the maximum step size to connect dense points. But they may still form a chain of density connected points, of arbitary shape or size.

I don't know what cluster 0 is in your R implementation. I've experimented with the R implementation, but it was waaaay slower than all the others. I don't recommend using R, there are much better tools for cluster analysis available, such as ELKI. Try running DBSCAN with your settings on ELKI, with LatLngDistanceFunction and and sort-tile-recursive loaded R-tree index. You'll be surprised how fast it can be, compared to R.

OPTICS is looking for the same density connected type of clusters. Are you sure this arbitrarily-shaped type of clusters is what you are looking for?

IMHO, you are using the wrong method for your goals (and you aren't really explaining what you are trying to achieve)

If you want a hard limit on the cluster diameter, use complete-linkage hierarchical clustering.

Has QUIT--Anony-Mousse
  • 70,714
  • 12
  • 123
  • 184
  • thanks!! yeah could be wrong method but till now it seems to be the best one. K means and PAM were the first and second attempt.I made a distance matrix using this formula:ACOS(SIN(lat1)*SIN(lat2)+COS(lat1)*COS(lat2)*COS(lon2-lon1))*6371. As its GPS data, I thought dbscan could be useful in order to see important places(data set has datetime stamp also).Now I know the important places but some of them are really big. DBSCAN is working properly as it should be. I just need to see what other information I can get from clusters. – sau Dec 31 '13 at 21:18
  • cluster 0 is the cluster with all points on the border.Also any comment about "how to remove border points" – sau Dec 31 '13 at 21:19
  • 1
    IIRC latest ELKI has a flag to return core points as "sub clusters" of the border points (they are more central). This is fairly trivial to implement, it's mostly a matter of memory usage. _Not_ keeping track of broder points is more efficient; so most implementations don't bother to track border points separately. – Has QUIT--Anony-Mousse Jan 01 '14 at 03:17
  • 1
    For finding important places, you actually don't need cluster analysis. You might be using the wrong screwdriver for your _nail_. Instead, consider looking for local maxima in a **density estimation** (but this will still mask important places nearby even more important places!) Please **define "important places" first**. Unless you are *precise* about what you are looking for, you probably won't find an algorithm to find it for you either. – Has QUIT--Anony-Mousse Jan 01 '14 at 03:19
  • If you want to have a hard limit on the cluster diameter, you may be looking for **complete-linkage clustering**. Unfortunately, most implementations are in `O(n^3)`, i.e. really slow. – Has QUIT--Anony-Mousse Jan 01 '14 at 15:05
  • Mean shift clustering looks more appropriate. But Not much material is available for that. R has the tools to do mean shift clustering. – sau Jan 02 '14 at 06:39
  • 1
    Sau, the formula you wrote in the first comment calculates distance in kilometers. Make sure that you specify the **epsilon** parameter this unit too, not meters or feet! When dealing with GPS data, it's a common mistake to forget syncing measurement units :) – sw0rdf1sh Jul 06 '14 at 11:33