12

I'm just getting into graph databases, and I seem to keep running into a problem deciding between using an "index node" or an "indexed property" for tracking things like "node type". Since I've no real experience thus far, I don't have any information to base the decision on and both approaches seem to be equally valid.

So, the question is: What are the tradeoffs between two approaches, and how does scale (ie. number of nodes) affect the decision?

For a sample scenario, lets assume there are two types of "things": User and Product, and the edges between the User nodes and the Product nodes don't matter so much, but what we care about is if we want type: User and type: Product properties on each node, or if we want each node to have an edge pointing back at a User node and a Product node, respectively.

Which approach is better under which circumstances?

Note: I'm looking at Neo4j and Titan in particular, but I would think that this will tend to apply more generally as well.

Charles
  • 48,924
  • 13
  • 96
  • 136
cdeszaq
  • 29,170
  • 25
  • 111
  • 169

3 Answers3

18

First, you need to ask yourself: Does the type of a vertex/node need to be indexed? I.e. do you need to retrieve vertices/nodes by their type, let's say, retrieve all 'user' vertices from the graph or do you need to answer queries that start by retrieving all vertices of a given type and then filter/process those further?

If the answer to this question is yes, then I suggest you store the type as a string property that is indexed. Or, if you are developing in a jvm based language, you could define a type enum and use that as the property type for more type safety and automatic error checking. Titan supports arbitrary user defined classes/enums as property types and will compress those for a low memory footprint.

However, the downside of this approach is that this won't scale because you are building a low selectivity index. What that means is that there will likely be very many vertices of type 'user' or 'product' and all those need to be associated with the index entry for 'user' or 'product' respectively. This makes maintaining and querying this index very expensive and hard to scale (imagine facebook had a 'type' index: the 'photo' entry would have billions of vertices under it). If you are not (yet) concerned with scaling, then this can work.

If the answer to the question is no, then I suggest to model types as vertices/nodes in the graph. I.e. have a 'user' vertex and a 'product' vertex and an edge labeled 'type' from each user to the 'user' vertex, etc.

The advantage of this approach is that you use the graph to model your data rather than having string values outside of your database represent crucial type information. As you build your application, the graph database will become its central component and last for a long time. As programming languages and developers come and go, you don't want data modeling and type information to go with them and be faced with the question: "What does SPECIAL_USER mean?" Rather, have a SPECIAL_USER vertex and add provenance information to it, i.e., who created this type, what does it represent and a short description - all in the database.

One problem with this approach is that the 'user' and 'product' vertices will have a lot of edges incident on them as your application scales. In other words, you are creating supernodes which create scaling issues. This is why Titan introduced the concept of a unidirectional edge. A unidirectional edge is like a link on the web: the starting vertex points to another vertex, but that vertex is unaware of the edge. Since you don't want to traverse from the 'user' vertex to all user vertices, you aren't loosing anything but gaining in scalability and performance.

  • So, in summary, using an indexed property allows working with all nodes-of-a-type more easily at the expense of scalability, while using index nodes is a more natural (ie. structural) representation at the expense of scalability. Does one way or the other substantially limit how the graph can be used? – cdeszaq Nov 20 '12 at 16:28
  • I am not sure I am following you. In case one follows your recommandation to use the second solution (thus the use of an edge to specify the type), how does he find the list of users ? as he can not traverse from the user vertex to the list of users... – Alexandre Blanquart Mar 11 '13 at 09:19
  • Thanks for the great answer Mathias. But could you talk a bit more about /not/ using unidirectional edges to the 'user' node? What the implications of that would be, in terms of querying in each direction? (I'm guessing none, and huge impact, unless you add index properties to the edge). And also with unidirectional edges, with Cassandra, you're limiting yourself to < 10's millions of nodes of 'user' type for example. – Antony Stubbs Aug 02 '13 at 15:18
  • Sorry, when I say /not/ using unidirectional, I meant, using bidirectional instead (as opposed to no edge). And also, given Titan's duplication of adjacency list information at edge vertex, does a bidirectional edge change the underlying data structure at all? Or is it just a synthetic query restriction? – Antony Stubbs Aug 02 '13 at 19:48
4

What kind of query do you want to ask? In Neo4j, you would create a User and a Product index or even combine them in one, and then be able to ask things like

start bob = node:User(name='Bob') match ....

and even fulltext search. For easy checking if a node is a User or Product, you could have the property still on the nodes, just for convenient and fast traversal. If you are not traversing from User/Product to the instance nodes (you do the index lookups for that), you can even do the check by having a PRODUCT or USER relationship back to the type (super)nodes, giving you a check in-traversal like

start s = node:User(name='Bob') match s-[r]-(product)-[typeRel:PRODUCT]->() return product 

HTH

Peter Neubauer
  • 6,239
  • 1
  • 19
  • 24
0

A very important reason for indexing has been missed here imo. Suppose you have a complex graph with many different properties, and many different node types, and you want to match a pattern with a "person" who has a bunch of properties.

With no indexes, you have no option but to traverse the graph, a graph in which maybe only 0.01% of the nodes are of type person. And traversals may not reach unconnected regions of the graph.

Instead, if I have indexed person, I just iterate through every person, and search locally around each person to see if their pattern matches.

You should be able to see instantly that the first of these approaches scales with the total size of the graph, but the second only scales with the total number of people in the graph.

Moral: If you envisage a use case where there will be many search of the graph with a particular type of node as the bound node in your pattern (e.g. here lots of searches for "people with pattern X"), then you should index these nodes for improved search performance.

If you are going to search for things like "all nodes within two links of person Peter", then indexing person by their name would be critical, and would allow constant time performance regardless of graph size - as you are essentially looking up the location of Peter in a hash table.

phil_20686
  • 3,712
  • 18
  • 38
  • The question is focused on discrete properties, like `nodeType`, rather than freeform properties, like `personName`, so your point about finding Peter's node don't factor in to this question. But it is a good distinction to make, since about the only sensible way to index a freeform property is with a traditional index. – cdeszaq Jan 28 '14 at 17:24
  • Also, the question is exactly about how to handle the .01% of your nodes. The question is "at what point do index-nodes make more / less sense than a traditional index for discrete fields, like nodeType". The question isn't about the _reason_ for indexing, it's about _what method of indexing works better, when and why?_ – cdeszaq Jan 28 '14 at 17:26