0

I have a problem with the executing speed of Titan queries.

To be more specific: I created a property file for my graph using BerkeleyJe which is looking like this:

storage.backend=berkeleyje
storage.directory=/finalGraph_script/graph

Afterwards, i opened the Gremlin.bat to open my Graph.

I set up all the neccessary Index Keys for my nodes:

m = g.getManagementSystem();
username = m.makePropertyKey('username').dataType(String.class).make()
m.buildIndex('byUsername',Vertex.class).addKey(username).unique().buildCompositeIndex()
m.commit()
g.commit()

(all other keys are created the same way...)

I imported a csv file containing about 100 000 lines, each line is producing at least 2 nodes and some edges. All this is done via Batchloading. That works without a Problem.

Then i execute a groupBy query which is looking like that:

m = g.V.has("imageLink").groupBy{it.imageLink}{it.in("is_on_image").out("is_species")}{it._().species.groupCount().cap.next()}.cap.next()

With this query i want for every node with the property key "imageLink" the number of the different "species". "Species" are also nodes, and can be called by going back the edge "is_on_image" and following the edge "is_species". Well this is also working like a charm, for my recent nodes. This query is taking about 2 minutes on my local PC.

But now to the problem. My whole dataset is a csv with 10 million entries. The structure is the same as above, and each line is also creating at least 2 nodes and some edges.

With my local PC i cant even import this set, causing an Memory Exception after 3 days of loading.

So I tried the same on a server with much more RAM and memory. There the Import works, and takes about 1 day. But the groupBy failes after about 3 days. I actually dont know if the groupBy itself fails, or just the Connection to the Server after that long time.

So my first Question: In my opinion about 15 million nodes shouldn't be that big deal for a graph database, should it?

Second Question: Is it normal that it takes so long? Or is there anyway to speed it up using indices? I configured the indices as listet above :(

I don't know which exact information you need for helping me, but please just tell me what you need in addition to that.

Thanks a lot! Best regards, Ricardo

EDIT 1: The way im loading the CSV in the Graph: I'm using this code, i deleted some unneccassry properties, which are also set an property for some nodes, loaded the same way.

bg = new BatchGraph(g, VertexIDType.STRING, 10000)
new File("annotation_nodes_wNothing.csv").eachLine({ final String line ->def (annotationId,species,username,imageLink) = line.split('\t')*.trim();def userVertex = bg.getVertex(username) ?: bg.addVertex(username);def imageVertex = bg.getVertex(imageLink) ?: bg.addVertex(imageLink);def speciesVertex = bg.getVertex(species) ?: bg.addVertex(species);def annotationVertex = bg.getVertex(annotationId) ?: bg.addVertex(annotationId);userVertex.setProperty("username",username);imageVertex.setProperty("imageLink", imageLink);speciesVertex.setProperty("species",species);annotationVertex.setProperty("annotationId", annotationId);def classifies = bg.addEdge(null, userVertex, annotationVertex, "classifies");def is_on_image = bg.addEdge(null, annotationVertex, imageVertex, "is_on_image");def is_species = bg.addEdge(null, annotationVertex, speciesVertex, "is_species");})
bg.commit()
g.commit() 
  • The `groupBy` part can probably be optimized, but the main problem is `g.V.has("imageLink")` - this requires a full graph scan. Can you modify this part so that it uses an index? – Daniel Kuppitz Mar 05 '15 at 08:17
  • Just an observation, but it really shouldn't take 1 day to load 15 million vertices. You don't say much about how you are doing that loading, but if you aren't using `BatchGraph` to load the data, you might be losing a simple optimization there. https://github.com/tinkerpop/blueprints/wiki/Batch-Implementation – stephen mallette Mar 05 '15 at 12:45
  • Thanks Daniel for your response. Yeah thats the same point I thought of. But how? I cant find any examples where you look for "All nodes with the same Property" using an index. It's clear to me that i can use an index with `g.query().has('imageLink', EQUALS, 'abc').vertices()`, but i simply want all nodes which has the imageLink attribute, dispite of its value. and it does not work with `g.query().has('imageLink').vertices()` – Ricardo Jacobsthal Mar 05 '15 at 12:49
  • @stephenmallette i edited my question and added on the bottom the exact way i'm importing my csv using batchgraph. Hope this helps! Im also thinking, it's taking way too long to import the nodes. – Ricardo Jacobsthal Mar 05 '15 at 12:58
  • That's a pretty simple load script. A larger batch size with as much -Xmx as you can spare should help. If you understand how your data is distributed, pre-sorting it to maximize re-use of the cache will help speed things up. – stephen mallette Mar 06 '15 at 13:12
  • What do you mean with "as much -Xmx as you can spare"? And i know how my data is distributet. I have all the data in a mongoDB collection, and im transorming an csv out of these. In which way sorting? Im sorry for my "dumb" questions but in actually totally new to Graph databases... – Ricardo Jacobsthal Mar 06 '15 at 17:28
  • By "-Xmx" - i was referring to the memory settings on your JVM. http://stackoverflow.com/q/14763079/1831717 and by "pre-sorting" I meant https://github.com/tinkerpop/blueprints/wiki/Batch-Implementation#presorting-data – stephen mallette Mar 07 '15 at 12:20
  • Ah okay, thanks. But due its a server from my university, i have to rights to change java / jvm settings. One line of my csv looks like this: `9 50c682d49177d00646000092 1 zebra ricardo http://imageLink` With this one "annotation" node by id (50c...) is created, one node for the species (zebra), one for the user(ricardo) and one for the image(imageLink). All of these nodes are connected via edges which are also created in the same step. So in which way presorting makes sense in this case? I know your referenced blueprints wiki, but cant figure out whats good for my usecase. ;-) – Ricardo Jacobsthal Mar 07 '15 at 16:43
  • @stephenmallette i figured out using the configuration option storage.batch-loading = true makes the import significant faster. But with this option enabled the data is not saved persistent to the backend. I have asked another question exactly about this problem here: http://stackoverflow.com/questions/28911801/titan-batchloading-berkeleydb-not-persistent Maybe you know also an answer to this. Thanks a lot stephen! – Ricardo Jacobsthal Mar 07 '15 at 16:45

0 Answers0