85

I have a list of about 100 igraph objects with a typical object having about 700 vertices and 3500 edges.

I would like to identify groups of vertices within which ties are more likely. My plan is to then use a mixed model to predict how many within-group ties vertices have using vertex and group attributes.

Some people may want to respond to other aspects of my project, which would be great, but the thing I'm most interested in is information about functions in igraph for grouping vertices. I've come across these community detection algorithms but I'm not sure of their advantages and disadvantages, or whether some other function would be better for my case. I saw the links here as well, but they aren't specific to igraph. Thanks for your advice.

Community
  • 1
  • 1
Michael Bishop
  • 1,705
  • 4
  • 15
  • 22

2 Answers2

194

Here is a short summary about the community detection algorithms currently implemented in igraph:

  • edge.betweenness.community is a hierarchical decomposition process where edges are removed in the decreasing order of their edge betweenness scores (i.e. the number of shortest paths that pass through a given edge). This is motivated by the fact that edges connecting different groups are more likely to be contained in multiple shortest paths simply because in many cases they are the only option to go from one group to another. This method yields good results but is very slow because of the computational complexity of edge betweenness calculations and because the betweenness scores have to be re-calculated after every edge removal. Your graphs with ~700 vertices and ~3500 edges are around the upper size limit of graphs that are feasible to be analyzed with this approach. Another disadvantage is that edge.betweenness.community builds a full dendrogram and does not give you any guidance about where to cut the dendrogram to obtain the final groups, so you'll have to use some other measure to decide that (e.g., the modularity score of the partitions at each level of the dendrogram).

  • fastgreedy.community is another hierarchical approach, but it is bottom-up instead of top-down. It tries to optimize a quality function called modularity in a greedy manner. Initially, every vertex belongs to a separate community, and communities are merged iteratively such that each merge is locally optimal (i.e. yields the largest increase in the current value of modularity). The algorithm stops when it is not possible to increase the modularity any more, so it gives you a grouping as well as a dendrogram. The method is fast and it is the method that is usually tried as a first approximation because it has no parameters to tune. However, it is known to suffer from a resolution limit, i.e. communities below a given size threshold (depending on the number of nodes and edges if I remember correctly) will always be merged with neighboring communities.

  • walktrap.community is an approach based on random walks. The general idea is that if you perform random walks on the graph, then the walks are more likely to stay within the same community because there are only a few edges that lead outside a given community. Walktrap runs short random walks of 3-4-5 steps (depending on one of its parameters) and uses the results of these random walks to merge separate communities in a bottom-up manner like fastgreedy.community. Again, you can use the modularity score to select where to cut the dendrogram. It is a bit slower than the fast greedy approach but also a bit more accurate (according to the original publication).

  • spinglass.community is an approach from statistical physics, based on the so-called Potts model. In this model, each particle (i.e. vertex) can be in one of c spin states, and the interactions between the particles (i.e. the edges of the graph) specify which pairs of vertices would prefer to stay in the same spin state and which ones prefer to have different spin states. The model is then simulated for a given number of steps, and the spin states of the particles in the end define the communities. The consequences are as follows: 1) There will never be more than c communities in the end, although you can set c to as high as 200, which is likely to be enough for your purposes. 2) There may be less than c communities in the end as some of the spin states may become empty. 3) It is not guaranteed that nodes in completely remote (or disconencted) parts of the networks have different spin states. This is more likely to be a problem for disconnected graphs only, so I would not worry about that. The method is not particularly fast and not deterministic (because of the simulation itself), but has a tunable resolution parameter that determines the cluster sizes. A variant of the spinglass method can also take into account negative links (i.e. links whose endpoints prefer to be in different communities).

  • leading.eigenvector.community is a top-down hierarchical approach that optimizes the modularity function again. In each step, the graph is split into two parts in a way that the separation itself yields a significant increase in the modularity. The split is determined by evaluating the leading eigenvector of the so-called modularity matrix, and there is also a stopping condition which prevents tightly connected groups to be split further. Due to the eigenvector calculations involved, it might not work on degenerate graphs where the ARPACK eigenvector solver is unstable. On non-degenerate graphs, it is likely to yield a higher modularity score than the fast greedy method, although it is a bit slower.

  • label.propagation.community is a simple approach in which every node is assigned one of k labels. The method then proceeds iteratively and re-assigns labels to nodes in a way that each node takes the most frequent label of its neighbors in a synchronous manner. The method stops when the label of each node is one of the most frequent labels in its neighborhood. It is very fast but yields different results based on the initial configuration (which is decided randomly), therefore one should run the method a large number of times (say, 1000 times for a graph) and then build a consensus labeling, which could be tedious.

igraph 0.6 will also include the state-of-the-art Infomap community detection algorithm, which is based on information theoretic principles; it tries to build a grouping which provides the shortest description length for a random walk on the graph, where the description length is measured by the expected number of bits per vertex required to encode the path of a random walk.

Anyway, I would probably go with fastgreedy.community or walktrap.community as a first approximation and then evaluate other methods when it turns out that these two are not suitable for a particular problem for some reason.

Tamás
  • 44,085
  • 11
  • 94
  • 119
  • What do you mean by leading eigenvector? – user Jun 16 '14 at 17:32
  • The leading eigenvector is the eigenvector corresponding to the eigenvalue with the largest absolute value (in this context). – Tamás Jun 16 '14 at 20:19
  • Is the definition same for PageRank? Wikipedia also lists dominant eigenvector for PageRank. – user Jun 16 '14 at 23:59
  • My question is actually, can I use power iteration to find leading vector in Newman's modularity method? – user Jun 17 '14 at 19:31
  • Theoretically, yes, but note that the modularity matrix is not sparse, so a naive implementation of the power iteration may be slow. However, the modularity matrix can be decomposed into the sum of a sparse matrix and a not-sparse-but-structured matrix so it is still possible to perform the multiplication of the power iteration in time proportional to the nonzero elements of the sparse part of the modularity matrix. – Tamás Jun 17 '14 at 20:56
  • @Tamás the walktrap and spinglass algorithms seem like they are not deterministic. Would you recommend, like for the label propagation algorithm, running them many times and getting a majority vote? Thanks in advance for your clarification. – Antoine Mar 28 '15 at 16:45
  • 3
    As far as I know this is quite a common thing to do with non-deterministic algorithms. However, you should be careful because community i in one run of the algorithm may not necessarily match community i in another run since the community IDs have no semantic meaning. – Tamás Mar 28 '15 at 20:45
  • @Tamás Thanks. I have noticed that running multiple spinglass would yield slightly different partitions every time (hence, the benefit of majority vote). However, walktrap would always return the exact same communities (?). Agreed for the IDs: that's why I tried to develop a consensus building function based on associations rather than membership IDs. see accepted answer here: http://stackoverflow.com/questions/29301156/get-consensus-of-multiple-partitioning-methods-in-r. However this method only truly works when all algos return the same number of communities, which is not always the case... – Antoine Apr 02 '15 at 22:06
  • 2
    There's a new one: `multilevel.community`. Mind adding it to your list? (The existing answer is excellent btw. Thank you!) – Zach Jun 09 '15 at 17:07
  • 1
    @Tamás could you please explain the difference between the `multilevel.community` and the `fastgreedy.community` algos? It seems that these two are very similar. Thanks a lot in advance. – Antoine Jul 14 '15 at 14:43
  • From reading the paper for the `multilevel.community`, it seems that it is basically the same as the `fastgreedy.community`, but slightly modified (improved) so that it scales-up better to networks with more than 10^6 nodes. Are there any other differences? – Antoine Jul 14 '15 at 14:52
  • 4
    They are not the same; `fastgreedy.community` merges pairs of communities iteratively, always choosing the pair that yields the maximum increase in the *overall* modularity. In `multilevel.community`, communities are not merged; instead of that, nodes are moved between communities such that each node makes a *local* decision that maximizes its *own* contribution to the modularity score. When this procedure gets stuck (i.e. none of the nodes change their membership), then all the communities are collapsed into single nodes and the process continues (that's why it is multilevel). – Tamás Jul 14 '15 at 23:44
  • Thanks @Tamás. Would you mind having a look at this [thread](http://stackoverflow.com/questions/31432176/potential-issue-with-new-igraph-layout-algorithms-r) please? Thanks in advance. – Antoine Jul 15 '15 at 13:46
  • @Tamás I know that you are not actively involved in igraph anymore, but do you know by chance if some community detection algorithms now take into account edge direction? Thanks in advance – Antoine Mar 25 '16 at 13:44
  • 2
    @Antoine: the InfoMap algorithm has a nice and scientifically sound approach to handling directed edges. The implementation in igraph is not the most efficient (due to licensing issues), but it should work for smaller graphs. If it turns out to be too slow, you can try the code of the authors of the algorithm from http://mapequation.org – Tamás Mar 25 '16 at 22:45
  • @Tamás How louvain with igraph handles outliers as "Phenograph" has a param "min_cluster_size" e.g. if i have cluster with 5 points louvain merge it into another cluster, i want to make a separate cluster. Note: I am using scanpy, "https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html" – Khalid Usman Apr 19 '19 at 03:42
  • @Tamas: great answer! Are any of the algorithm you described able to make use of node attributes and edge attributes? – stats555 Nov 17 '20 at 01:08
  • Can someone take a look at these questions please? https://stackoverflow.com/questions/64849921/r-k-means-clustering-vs-community-detection-algorithms-weighted-correlation-ne https://stackoverflow.com/questions/64864298/r-are-node-attributes-and-edge-attributes-used-during-network-graph-cluster – stats555 Nov 17 '20 at 01:10
  • Nice discussions and thank you so much! I have one question, how to calculate the NMI value if I obtained the `appropriate L{VertexDendrogram} object.` returned by `Graph.community_fastgreedy(g)`? I have the ground truth of community for each vertex. – sonictl May 20 '21 at 04:01
  • First you need to cut the dendrogram somewhere to obtain a "flat" clustering (see the `as_clustering()` method of the dendrogram), and then use the `compare_communities()` function. – Tamás May 20 '21 at 11:50
14

A summary of the different community detection algorithms can be found here: http://www.r-bloggers.com/summary-of-community-detection-algorithms-in-igraph-0-6/

Notably, the InfoMAP algorithm is a recent newcomer that could be useful (it supports directed graphs too).

timothyjgraham
  • 1,042
  • 1
  • 14
  • 26
  • from the [documentation](http://www.inside-r.org/packages/cran/igraph/docs/infomap.community) it does not seem that edge direction is taken into account. Right? – Antoine Mar 18 '16 at 09:01