How to optimize the performance of Postgres COPY IN?

Question

CopyManager cp;

//loop until 1 million rows have been persisted

StringBuilder sb = new StringBuilder(); //build 10k entries, comma separated
cp.copyIn("COPY tableName FROM STDIN WITH CSV", new StringReader(sb.toString()));

The copyIn itself takes about 30 seconds. Would you consider this reasonable performance? Is there anything I could do to optmize the performance?

Reason: I have to batch insert a few hundred million rows a few times a day. Thus, when importing like 500 million rows, that would last 4h!

How could I increase the speed? Switch to a different DB? Create multi threads an import in parallel?

Why don't you feed the lines from within the loop directly into the `copyIn()` method? Building everything in memory seems rather complicated. Or put the 1 million rows into a file and then feed the rows directly from the file to the `copyIn()` method. — a_horse_with_no_name, Apr 21 '15 at 19:47
I have a `csv` file that I have to read out, split the lines, take out unwanted content and then persist it using the copy command. That's why I build them in memory, chunked to eg 10k and sending each in bulk with copy. — membersound, Apr 21 '15 at 19:50
Depending on how many rows you remove from the file, it's probably faster to insert them all into a staging table using the copy command and then move only the "wanted" rows into the target table using `insert ... select...` - but I'd look into some kind of filtered Reader that you pass into the `copyIn()` method and the filter the rows while sending them to the server. — a_horse_with_no_name, Apr 21 '15 at 20:25
What if I also have to transform some of the columns? Filtering + Transformation is the main reason why I build the resulting csv rows prior using `StringBuilder`. — membersound, Apr 21 '15 at 20:53
10k row batches is big enough that there's not going to be much benefit from going bigger. That seems fine. Parallelism may well help, yes. Really, you need to *analyse why this is slow* - look for bottlenecks. Is Java using 100% CPU so it can't keep up with PostgreSQL's writes? Is PostgreSQL saturating the disk with writes? Is PostgreSQL doing lots of index lookups (foreign keys) that're slowing things down? Are there unnecessary indexes on the table(s)? etc. — Craig Ringer, Apr 22 '15 at 00:23
I'm inserting on a temp table that has no indexes nor foreign or primary keys. CPU is about 30%, and my bottleneck see to really be the postgres `copyIn`. How could I find out if "postgresql saturating the disk with writes"? I tried executing the copyin parallel from multiple threads using also a connection pool db driver. But that did not change anything. Does the copyin command maybe not support parallelism and any copyin commands send to the DB are executed synchronized by the DB automaticaly? — membersound, Apr 22 '15 at 06:08

How to optimize the performance of Postgres COPY IN?

0 Answers0