Questions tagged [hierarchical-clustering]

Hierarchical clustering is a clustering technique that generates clusters at multiple hierarchical levels, thereby generating a tree of clusters. Hierarchical clustering provides advantages to analysts with its visualization potential.

Hierarchical clustering is a clustering technique that generates clusters at multiple hierarchical levels, thereby generating a tree of clusters.

Examples

Common methods include DIANA (DIvisive ANAlysis) which performs top down clustering (usually starts from the entire data set and then divides it till eventually a point is reached where each data point resides in a single cluster, or reaches a user-defined condition).

Another widely known method is AGNES (AGlomerative NESting) which basically performs the opposite of DIANA.

Distance metric& some advantages

There are multitude of ways to compute the distance metric upon which the clustering techniques divide/accumulate in to new clusters (as complete and single link distances which basically compute maximum and minimum respectively).

Hierarchical clustering provides advantages to analysts with its visualization potential, given its output of the hierarchical classification of a dataset. Such trees (hierarchies) could be utilized in a myriad of ways.

Other non-hierarchical clustering techniques

Other clustering methodologies include, but are not limited to, partitioning techniques (as k means and PAM) and density based techniques (as DBSCAN) known for its advantageous discovery of unusual cluster shapes (as non-circular shapes).

Suggested learning sources to look into

  • Han, Kamber and Pei's Data Mining book; whose lecture slides and companion material could be found here.
  • Wikipedia has an entry on the topic here.
1079 questions
43
votes
2 answers

Use Distance Matrix in scipy.cluster.hierarchy.linkage()?

I have a distance matrix n*n M where M_ij is the distance between object_i and object_j. So as expected, it takes the following form: / 0 M_01 M_02 ... M_0n\ | M_10 0 M_12 ... M_1n | | M_20 M_21 0 ... …
Sibbs Gambling
  • 16,478
  • 33
  • 87
  • 161
37
votes
1 answer

Tutorial for scipy.cluster.hierarchy

I'm trying to understand how to manipulate a hierarchy cluster but the documentation is too ... technical?... and I can't understand how it works. Is there any tutorial that can help me to start with, explaining step by step some simple…
user2988577
  • 2,825
  • 5
  • 19
  • 21
35
votes
4 answers

Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a…
29
votes
1 answer

differences in heatmap/clustering defaults in R (heatplot versus heatmap.2)?

I'm comparing two ways of creating heatmaps with dendrograms in R, one with made4's heatplot and one with gplots of heatmap.2. The appropriate results depend on the analysis but I'm trying to understand why the defaults are so different, and how to…
user248237
28
votes
2 answers

Extracting clusters from seaborn clustermap

I am using the seaborn clustermap to create clusters and visually it works great (this example produces very similar results). However I am having trouble figuring out how to programmatically extract the clusters. For instance, in the example link,…
24
votes
2 answers

Hierarchical clustering of 1 million objects

Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange. hcluster had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but…
24
votes
1 answer

How to give sns.clustermap a precomputed distance matrix?

Usually when I do dendrograms and heatmaps, I use a distance matrix and do a bunch of SciPy stuff. I want to try out Seaborn but Seaborn wants my data in rectangular form (rows=samples, cols=attributes, not a distance matrix)? I essentially want…
O.rka
  • 24,289
  • 52
  • 152
  • 253
24
votes
3 answers

Clustering words based on Distance Matrix

My objective is to cluster words based on how similar they are with respect to a corpus of text documents. I have computed Jaccard Similarity between every pair of words. In other words, I have a sparse distance matrix available with me. Can anyone…
23
votes
1 answer

how do I get the subtrees of dendrogram made by scipy.cluster.hierarchy

I had a confusion regarding this module (scipy.cluster.hierarchy) ... and still have some ! For example we have the following dendrogram: My question is how can I extract the coloured subtrees (each one represent a cluster) in a nice format, say…
titan
  • 287
  • 1
  • 3
  • 8
22
votes
5 answers

Distributed hierarchical clustering

Are there any algorithms that can help with hierarchical clustering? Google's map-reduce has only an example of k-clustering. In case of hierarchical clustering, I'm not sure how it's possible to divide the work between nodes. Other resource that I…
Roman
  • 12,757
  • 2
  • 44
  • 63
22
votes
3 answers

How to specify a distance function for clustering?

I'd like to cluster points given to a custom distance and strangely, it seems that neither scipy nor sklearn clustering methods allow the specification of a distance function. For instance, in sklearn.cluster.AgglomerativeClustering, the only thing…
Mark Morrisson
  • 1,950
  • 3
  • 16
  • 21
20
votes
1 answer

How to adjust branch lengths of dendrogram in matplotlib (like in astrodendro)? [Python]

Here is my resulting plot below but I would like it to look like the truncated dendrograms in astrodendro such as this: There is also a really cool looking dendrogram from this paper that I would like to recreate in matplotlib. Below is the code…
O.rka
  • 24,289
  • 52
  • 152
  • 253
20
votes
3 answers

Tag hierarchies and handling of

This is a real issue that applies on tagging items in general (and yes, this applies to StackOverflow too, and no, it is not a question about StackOverflow). The whole tagging issue helps cluster similar items, whatever items they may be (jokes,…
tzot
  • 81,264
  • 25
  • 129
  • 197
19
votes
4 answers

How to get flat clustering corresponding to color clusters in the dendrogram created by scipy

Using the code posted here, I created a nice hierarchical clustering: Let's say the the dendrogram on the left was created by doing something like Y = sch.linkage(D, method='average') # D is a distance matrix cutoff = 0.5*max(Y[:,2]) Z =…
conradlee
  • 10,743
  • 15
  • 46
  • 81
17
votes
2 answers

Implementing an efficient graph data structure for maintaining cluster distances in the Rank-Order Clustering algorithm

I'm trying to implement the Rank-Order Clustering here is a link to the paper (which is a kind of agglomerative clustering) algorithm from scratch. I have read through the paper (many times) and I have an implementation that is working although it…
YellowPillow
  • 3,000
  • 3
  • 25
  • 50
1
2 3
71 72