How to Store Very Large Graphs Space Efficiently Yet have Fast Indexing?

Question

I am working on a graph with 875713 nodes and 5105039 edges. Using vector<bitset<875713>> vec(875713) or array<bitset<875713>, 875713> throws a segfault at me. I need to calculate all-pair-shortest-paths with path recovery. What alternative data structures do I have?

I found this SO Thread but it doesn't answer my query.

EDIT

I tried this after reading the suggestions, seems to work. Thanks everyone for helping me out.

vector<vector<uint>> neighboursOf; // An edge between i and j exists if
                                   // neighboursOf[i] contains j
neighboursOf.resize(nodeCount);

while (input.good())
{
    uint fromNodeId = 0;
    uint toNodeId = 0;

    getline(input, line);

    // Skip comments in the input file
    if (line.size() > 0 && line[0] == '#')
        continue;
    else
    {
        // Each line is of the format "<fromNodeId> [TAB] <toNodeId>"
        sscanf(line.c_str(), "%d\t%d", &fromNodeId, &toNodeId);

        // Store the edge
        neighboursOf[fromNodeId].push_back(toNodeId);
    }
}

If you graph is dense, than an adjacency matrix is a pretty compact presentation. The Wikipedias article on graph has good reasons for selection of a presentation — fork0, Jul 15 '12 at 12:20
You seem to use the adjacency matrix... And it would be a lot: 875713^2/8 ... ~90Gib. — fork0, Jul 15 '12 at 12:25
You have only about 6 edges per node on average. You could use an array of nodes and a list of edges per node. If the max number of edges per node is small, you could use an array instead of a linked list to store the edges. This will be much smaller and therefore faster than a adjacency matrix. — Mackie Messer, Jul 15 '12 at 12:28
That's a lot to keep in memory, but if you're on a 64bit platform, you can use a memory mapped file. If you're lucky, it will be even sparse (with no space allocated for unused areas) and wont use the 90Gb. And operating system will take care of caching for you. — fork0, Jul 15 '12 at 12:34
@fork0 That's like using nuclear ordnance for pest control, while not even being certain it'll kill the cockroaches — user1071136, Jul 15 '12 at 13:15
@user1071136 actually, you're very likely to use mmap already: standard library allocators are known to use mmap on /dev/zero for memory reservation. Besides, people are know to overreact when it comes to pest control — fork0, Jul 15 '12 at 13:24

user1071136 · Answer 1 · 2012-07-15T12:43:45.697

3

Your graph is sparse, that is, |E| << |V|^2, so you should probably either use a sparse matrix to represent your adjacency matrix, or equivalently, store for each node a list of its neighbors (which is results in a jagged array), like this -

vector<vector<int> > V (number_of_nodes);
// For each cell of V, which is a vector itself, push only the indices of adjacent nodes.
V[0].push_back(2);   // Node number 2 is a neighbor of node number 0
...
V[number_of_nodes-1].push_back(...);

This way, your expected memory requirements are O(|E| + |V|) instead of O(|V|^2), which in your case should be around 50 MB instead of a gazzillion MB.

This will also result in a faster Dijkstra (or any other shortest-path algorithm) since you only need to consider the neighbors of a node at each step.

edited Jul 15 '12 at 12:43

answered Jul 15 '12 at 12:29

user1071136

15,368
2
35
59

This is probably the best way. I'll wait some more for other interesting answers. In this way, random edge access takes O() time. But I expect the number to be `<= 5000` (imagine a social network graph). – Hindol Jul 15 '12 at 16:54
What do you mean "random edge access"? In what situation do you need it? – user1071136 Jul 15 '12 at 17:16
By random I mean indexed access like `edge[src][dst]` where src and dst do not follow a specific pattern. Like edge[200][0] is accessed right after edge[15][29]. I need random access to _recover_ the all-pair-shortest-paths from the _parent_ matrix. In the _parent_ matrix the value of parent[i][j] denotes the node right before j in the shortest path between i and j. – Hindol Jul 15 '12 at 18:17
You could simply have another matrix `edge_weights[i][j]` which holds the weight of the edge from `parent[i][j]` to `j`, which is updated together with `parent`. – user1071136 Jul 15 '12 at 19:55

Mackie Messer · Accepted Answer · 2012-07-16T12:44:37.067

You could store lists of edges per node in a single array. If the number of edges per node is variable you can terminate the lists with a null edge. This will avoid the space overhead for many small lists (or similar data structures). The result could look like this:

enum {
    MAX_NODES = 875713,
    MAX_EDGES = 5105039,
};

int nodes[MAX_NODES+1];         // contains index into array edges[].
                                // index zero is reserved as null node
                                // to terminate lists.

int edges[MAX_EDGES+MAX_NODES]; // contains null terminated lists of edges.
                                // each edge occupies a single entry in the
                                // array. each list ends with a null node.
                                // there are MAX_EDGES entries and MAX_NODES
                                // lists.

[...]

/* find edges for node */
int node, edge, edge_index;
for (edge_index=nodes[node]; edges[edge_index]; edge_index++) {
    edge = edges[edge_index];
    /* do something with edge... */
}

Minimizing the space overhead is very important since you have a huge number of small data structures. The overhead for each list of nodes is just one integer, this is much less than the overhead of e.g. a stl vector. Also the lists are continuously layed out in memory, which means that there is no wasted space between any two lists. With variable sized vectors this will not be the case.

Reading all edges for any given node will be very fast because the edges for any node are stored continuously in memory.

The downside of this data arrangement is that when you initialize the arrays and construct the edge lists, you need to have all the edges for a node at hand. This is not a problem if you get the edges sorted by node, but does not work well if the edges are in random order.

`"Minimizing the space overhead is very important"` <= Absolutely! I tried the `vector of vectors` approach and it slows down the system as hell. I had to restart twice for that. — Hindol, Jul 15 '12 at 16:58
How do I actually measure the space overhead? I am on Ubuntu. I would like to compare this approach with user1071136's approach. — Hindol, Jul 16 '12 at 05:15
@Hindol see the updated answer. The total space should be around 26Mbytes, see the size of the arrays. To measure the overhead of the other solutions you could construct a smaller graph with less edges and measure memory consumption... — Mackie Messer, Jul 16 '12 at 12:48
As far as I know GCC's `vector`, when compiled with `-O3` is identical to an array. Other `vector`s, notably Visual Studio's, are "safe", which means every access to them is checked, slowing everything down. See http://stackoverflow.com/questions/1945777/c-array-vs-vector and also http://stackoverflow.com/a/1945818/1071136 — user1071136, Sep 24 '12 at 13:07
@user1071136: The main consideration here is space not range checking indices. Every `vector` has a fixed overhead (24bytes on my computer) and often has a capacity bigger than its size. This can become a problem if you have a lot (like millions) of short vectors. A huge data structure like this graph means that you are probably limited by the memory bandwidth. A compact representation will make a difference. — Mackie Messer, Sep 25 '12 at 00:52

score 1 · Answer 3 · answered Jul 15 '12 at 12:38

1

If we declare a Node as below:

struct{
int node_id;
vector<int> edges; //all the edges starts from this Node.
} Node;

Then all the nodes can be expressed as below:

array<Node> nodes;

answered Jul 15 '12 at 12:38

Derui Si

1,025
1
9
13

How to Store Very Large Graphs Space Efficiently Yet have Fast Indexing?

3 Answers3