3

I have 26 CSV files that I want to grab from the internet on a nightly basis and upload them into a Postgresql table. I have this working using Java, PreparedStatement, and Batch. Despite this, performance is painfully slow. To grab the 6000 or so entries and put them into Postgresql, it's taking 30 minutes. This is my first time doing something like this, so I don't exactly have a reference point as to whether this is fast or slow.

To get the file, I am using this code.

URL grabberUrl = new URL(csvUrl);
URLConnection grabberConn = grabberUrl.openConnection();
BufferedReader grabberReader = new BufferedReader(new InputStreamReader(grabberConn.getInputStream())); 

I am then using PreparedStatement to and taking values from the input stream and setting them

con = DriverManager.getConnection(url, user, password);
pst = con.prepareStatement("insert into blah(name, year) values(?, ?)");
pst.setString(1, name);
pst.setString(2, year);

I am then batching up the inserts. I've tried values from 100 to 1000 with no meaningful change to performance.

pst.addBatch();

if (count == 100) { 
count = 0;
pst.executeBatch(); 
}

Has anyone got any suggestions as to what I can do to make things faster?

user2254180
  • 684
  • 10
  • 25
  • are you sure that the performance is not declining in the first part of the process, i.e. getting the files from the internet? –  Aug 14 '14 at 22:34
  • 1
    Is autoCommit enabled? Do you commit after each batch or at the end? – MadProgrammer Aug 14 '14 at 22:38
  • Definitely not, I've set a timer on grabbing the files, the very biggest files take less than 5 seconds to download. – user2254180 Aug 14 '14 at 22:38
  • autoCommit is enabled – user2254180 Aug 14 '14 at 22:39
  • This is awfully slow. Make sure the slowness is with the database and not with the download. Try executing 6000 insert queries from psql and see if that's slow as well (which would indicate a database or disk or network problem). If the performance is really with the Java code, then post it here (not just snippets). – JB Nizet Aug 14 '14 at 22:39
  • 4
    Disable autocommit. Use a single transaction to insert everything. For the record, I insert more than 100000 rows, without even using batches, in 20 seconds. – JB Nizet Aug 14 '14 at 22:40
  • 1
    See http://stackoverflow.com/questions/12206600/how-to-speed-up-insertion-performance-in-postgresql and http://stackoverflow.com/q/758945/398670 . Specifically for Java use the `CopyManager`. – Craig Ringer Aug 15 '14 at 00:49
  • +1 for postgres COPY – Konstantin V. Salikhov Aug 15 '14 at 06:35
  • Thank you for the suggestion to disable autoCommit - that appears to have been the primary issue. With that off and not even batching, my application took 3 minutes 30 to run. With batching it went down to just under 2 minutes. I should also have mentioned my Postgesql instance is an Amazon Micro instance, so that probably explains it takes 2 minutes to run and not 30 seconds. – user2254180 Aug 15 '14 at 13:40
  • Have you tried PGs COPY cmd to see how quickly it can run once you have the data? http://www.postgresql.org/docs/9.2/static/sql-copy.html Also can you see how long it takes to grab the data WITHOUT sending to db to get an idea of how long the data read takes? I had a similar problem but not in Java and ended up writing 1 insert cmd with many sets of values (batching that way) rather than many inserts with one set of values...Nick. – nickL Aug 19 '14 at 21:55

2 Answers2

1

If you can access the files from the PostgreSQL server try using the copy statement. See link http://www.postgresql.org/docs/9.3/static/sql-copy.html

Also, if you know the data quality you can temporarily remove any table constraints and drop any index's. You can add the constraints and the index's after loading the data.

Tim Child
  • 2,836
  • 1
  • 22
  • 25
1

Try the following:

PGConnection con = (PGConnection) DriverManager.getConnection(...);

CopyManager copyManager = con.getCopyAPI();

copyManager.copyIn("copy mytable from stdin with (format csv)", grabberReader);

If mytable is heavily indexed, then drop the indexes, load, and recreate the indexes.

Neil McGuigan
  • 41,314
  • 10
  • 106
  • 137