high performance data load

Question

I'm storing the output of a sensor pod in a Postgres db. There are many sensor pods (dozens), each generating a reading every 100ms, 24 hours per day - that's 86x,xxx records per pod per day. The sensor pod is relatively dumb, and lives in an environment of unreliable connectivity, so it produces n(tbd) line CSV files locally, and then ships them all off when it has network access.

The sensor pod knows its own name, and the data it produces but that's about it. I'm trying to decide how to load the data efficiently into the DB. There are two options I'm considering:

Use the COPY syntax, and give it the file directly
Do a mass insert

Can someone speak to the performance ramifications of both of those?

My hesitation with option 1 is that I need to supply some ancillary data (some foreign keys, etc.) that won't be in the file. The only way to do that, without making the sensor pod aware of the random crap that's also going into that table is to generate temporary tables for the load, and then move them into their final destination with an insert .. select which seems wasteful.

This is a high write / low read environment.

I would use different approach since your scenario is `high write / low read environment`. I would use a nosql database like hadoop or mongodb (if you can call then databases). It is just a suggestion I don't know enough to answer you about your specific question. — Jorge Campos, Jun 13 '16 at 16:42
@JakubKania your unhelpful comment would have been less unhelpful if the question had been "how do I do a bulk insert." My question is specifically about the performance characteristics of two concrete ways to go about my problem. — kolosy, Jun 13 '16 at 21:03
@kolosy And my answer is: Test and see. The topic is wide known and there's even an excellent SO answer here: http://stackoverflow.com/a/12207237/2115135 . Once you will test and see if neither of those two performs well enough you can come back with data and then someone will tell you if you did something wrong. And even then the answer might be to throw more iron at it. — Jakub Kania, Jun 13 '16 at 21:21
@JakubKania that's more helpful, but still not an answer to the question I'm asking. If you don't know the answer, it's entirely ok to move along without commenting. Thank you for the link, though. — kolosy, Jun 13 '16 at 21:26
I'd prefer option (1) But: have each pod put their data files into a (separate?) spool directory, Have the import process aggregate these into a *TEMP* table, do some transformations on the temp table, and insert the result into the final table. [then: move the spool files into another directory, where they can be deleted; or (on error) moved to a "reject" directory] — joop, Jun 14 '16 at 10:09

high performance data load

0 Answers0