Bulkimport / arangoimp

Question

Eventually I've to load 35GB of data in an aragnodb instance.
So far I've tried those approaches to load only 5GB (and failed):

Loading via gremlin. It worked, but it took something like 3 days; this is not an option.
bulkimport features an import? API endpoint but I got the following error:
...[1] WARNING maximal body size is 536870912, request body size is -2032123904
arangodbimp command but I ended up with two different errors:
- With no/small --batch-size it fires
  import file is too big. please increase the value of --batch-size
- With a bigger --batch-size it returns the same error as the bulkimport.

Could someone tell me how to fix does commands, or a way to actually load those data?

Thanks

Edit for @DavidThomas, here comes the specs:
- RAM: 128G
- CPU: 2x Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz
- OS: Linux (ubuntu) sneezy 3.13.0-86-generic
- HDD: classic (non SSD)

What were the specs of your ArangoDB Server? RAM, HDD, CPU, OS. I know RAM is important. I'm also interested for my work in Arango. Cheers, — David Thomas, Jun 12 '16 at 04:44
Thanks for stats. I've done imports but used a node.js app to open a stream reader on the import file (which was in csv or json format) and then just push the records in (using the .createReadStream within the fs package). Turning off WaitForSync can speed it up but there could be other issues that raises. I'm interested to see Arango support answer this. — David Thomas, Jun 14 '16 at 04:15

score 1 · Accepted Answer · edited May 23 '17 at 12:08

I hope you're not using ArangoDB 2.4 as in your link to ArangoImp? ;-)

For our Performance Blogpost series we imported the pokec dataset using arangoimp. The Maximum POST body size of the server is 512MB.

For peformance reasons, arangoimp doesn't parse the json, but rather leans on one line of your import file being one document to send, so it can easily chop it into bits of valid json.

It therefore can't handle chunking in json dumps like this:

[
{ "name" : { "first" : "John", "last" : "Connor" }, "active" : true, "age" : 25, "likes" : [ "swimming"] },
{ "name" : { "first" : "Lisa", "last" : "Jones" }, "dob" : "1981-04-09", "likes" : [ "running" ] }
]

and thus will attempt to send the whole file at once; if that exceeds your specified batch-size, you will get the import file is too big errormessage.

However, if your file contains one document per line:

{ "name" : { "first" : "John", "last" : "Connor" }, "active" : true, "age" : 25, "likes" : [ "swimming"] }
{ "name" : { "first" : "Lisa", "last" : "Jones" }, "dob" : "1981-04-09", "likes" : [ "running" ] }

it can handle chunking per line along the --batch-size down to a minimum size of 32kb.

you therefore need to prepare your dump along the guidlines above in order to use arangoimp.

Since arangoimp also uses the import API, it has the same limitations as using it raw. You need to write a tiny programm using a stream enabled json parser and translate the output to be one document per line. You may then directly send chunks to the server in your script, or use arangoimp to handle the chunking for you.

Thank you so much for your support. I will follow your tips, and then I will report back. — Martin, Jun 16 '16 at 16:58

Bulkimport / arangoimp

1 Answers1