14

My scenario is pretty straightforwrd: I have a bunch of news articles (~1k at the moment) for which I know that some cover the same story/topic. I now would like to group these articles based on shared story/topic, i.e., based on their similarity.

What I did so far is to apply basic NLP techniques including stopword removal and stemming. I also calculated the tf-idf vector for each article, and with this can also calculate the, e.g., cosine similarity based on these tf-idf-vectors. But now with the grouping of the articles I struggles a bit. I see two principle ways -- probably related -- to do it:

1) Machine Learning / Clustering: I already played a bit with existing clustering libraries, with more or less success; see here. On the one hand, algorithms such as k-means require the number of clusters as input, which I don't know. Other algorithms require parameters that are also not intuitive to specify (for me that is).

2) Graph algorithms: I can represent my data as a graph with the articles being the nodes and weighted adges representing the pairwise (cosine) similarity between the articles. With that, for example, I can first remove all edges that fall below a certain threshold and then might apply graph algorithms to look for strongly-connected subgraphs.

In short, I'm not sure where best to go from here -- I'm still pretty new in this area. I wonder if there some best practices for that, or some kind of guidelines which methods / algorithms can (not) be applied in certain scenarios.

(EDIT: forgot to link to related question of mine)

Community
  • 1
  • 1
Christian
  • 2,844
  • 5
  • 27
  • 62
  • 1
    I don't think there's a single "best way" to perform this task; in fact, from the way you posed your question, you could potentially apply dozens of different algorithms and obtain qualitatively similar results. Did you read [any paper on text categorization](http://scholar.google.it/scholar?q=text+categorization)? – The Coding Monk Aug 10 '14 at 12:50
  • Yeah, I started reading about similarity measures, (unsupervised) learning / clustering, and related topics. But in parallel I would also like just to try some things out -- you know, "learning by doing" or if it happens "learning by burning/failing". I don't expect THE one best way to do. However it seems like a rather common task, and I was hoping for some best-practice approaches. – Christian Aug 11 '14 at 03:19
  • I really want to know how you clustering your dataset because I have a set of resumes that I want to cluster and categories them and am pretty new to the field how can I contact you sorry because I'm not answering here but you are my last hope *_* sorry again – Abeer zaroor Feb 23 '16 at 11:57
  • @Abeerzaroor -- See this [StackOverflow question](http://stackoverflow.com/questions/25217065/scikit-learn-clustering-text-documents-using-dbscan) of mine with a minimal, now working example. It's essentially just a simplified version of the more elaborate example on the [SciKit-Learn website](http://scikit-learn.org/stable/auto_examples/text/document_clustering.html). Both links should help you to start rolling. – Christian Feb 24 '16 at 01:44

3 Answers3

3

Try the class of Hierarchical Agglomerative Clustering HAC algorithms with Single and Complete linkage.

These algorithms do not need the number of clusters as input.

The basic principle is similar to growing a minimal spanning tree across a given set of data points and then stop based on a threshold criteria. A closely related class is the Divisive clustering algorithms which first builds up the minimal spanning tree and then prunes off a branch of the tree based on inter-cluster similarity ratios.

Debasis
  • 3,200
  • 1
  • 16
  • 21
  • Debasis, I will look into that! Are there any off-the-shell tools or libraries I can use to some first ideas? – Christian Aug 11 '14 at 03:22
  • You can use Weka.. It has both the single link and the complete link implementations of HAC... http://weka.sourceforge.net/doc.dev/weka/clusterers/HierarchicalClusterer.html – Debasis Aug 11 '14 at 09:26
1

You can also try a canopy variation on k-means to create a relatively quick estimate for the number of clusters (k).

http://en.wikipedia.org/wiki/Canopy_clustering_algorithm

Will you be recomputing over time or do you only care about a static set of news? I ask because your k may change a bit over time.

  • For the time being I just trying to play around with off-the-shell algorithms/libraries to get some idea how stuff works and what results to expect. But you're right, in the long run the set of documents will be dynamic. I already started reading about incremental solutions of clustering algorithms. – Christian Aug 11 '14 at 23:24
1

Since you can model your dataset as a graph you could apply stochastic clustering based on markov models. Here are link for resources on MCL algorithm:

Official thesis description and code base

Gephi plugin for MCL (to experiment and evaluate the method)

Tom Morris
  • 3,889
  • 2
  • 21
  • 42
AxxE
  • 753
  • 1
  • 8
  • 18
  • This look pretty interesting, thanks! I installed the Gephi plugin and applied it on some sample data. – Christian Aug 14 '14 at 01:10
  • Christian, I'd be interested in knowing more about your experiment and results with graph-based approaches and otherwise. – AxxE Aug 17 '14 at 17:20