23

I'm trying to create a java program to cleanup and merge rows in my table. The table is large, about 500k rows and my current solution is running very slowly. The first thing I want to do is simply get an in-memory array of objects representing all the rows of my table. Here is what I'm doing:

  • pick an increment of say 1000 rows at a time
  • use JDBC to fetch a resultset on the following SQL query SELECT * FROM TABLE WHERE ID > 0 AND ID < 1000
  • add the resulting data to an in-memory array
  • continue querying all the way up to 500,000 in increments of 1000, each time adding results.

This is taking way to long. In fact its not even getting past the second increment from 1000 to 2000. The query takes forever to finish (although when I run the same thing directly through a MySQL browser its decently fast). Its been a while since I've used JDBC directly. Is there a faster alternative?

Ish
  • 1,825
  • 4
  • 21
  • 32

3 Answers3

23

First of all, are you sure you need the whole table in memory? Maybe you should consider (if possible) selecting rows that you want to update/merge/etc. If you really have to have the whole table you could consider using a scrollable ResultSet. You can create it like this.

// make sure autocommit is off (postgres)
con.setAutoCommit(false);

Statement stmt = con.createStatement(
                   ResultSet.TYPE_SCROLL_INSENSITIVE, //or ResultSet.TYPE_FORWARD_ONLY
                   ResultSet.CONCUR_READ_ONLY);
ResultSet srs = stmt.executeQuery("select * from ...");

It enables you to move to any row you want by using 'absolute' and 'relative' methods.

pablochan
  • 5,269
  • 23
  • 40
  • Awesome. that did the trick. To answer your point about selectively getting data. Unfortunately I don't know which rows to merge and fix ahead of time -- I have to go through all the rows and inspect, build appropriate in-memory hashmaps, and then go back and clean the table based on certain qualities. – Ish Jul 03 '09 at 22:14
  • This method is rather fragile. If you have millions of rows and some processing to do you might hit a network lag or a timeout which will make it difficult to resume the action in certain cases. – user43685 Feb 24 '11 at 16:32
  • Unfortunately this works very slow on large tables, because MySQL JDBC driver doesn't support cursors and driver tries to load all data into the memory – Dmitriy Tarasov Mar 21 '12 at 15:27
  • i'm using postgressql for the DB, and it didn't help. still gettng OOM. – Joey Baruch Oct 11 '15 at 15:37
  • Check if the driver you're using supports this functionality. Think about other possible causes (what's the heap size? how much data are you trying to put in memory?). – pablochan Oct 11 '15 at 15:47
2

Although it's probably not optimum, your solution seems like it ought to be fine for a one-off database cleanup routine. It shouldn't take that long to run a query like that and get the results (I'm assuming that since it's a one off a couple of seconds would be fine). Possible problems -

  • is your network (or at least your connection to mysql ) very slow? You could try running the process locally on the mysql box if so, or something better connected.

  • is there something in the table structure that's causing it? pulling down 10k of data for every row? 200 fields? calculating the id values to get based on a non-indexed row? You could try finding a more db-friendly way of pulling the data (e.g. just the columns you need, have the db aggregate values, etc.etc)

If you're not getting through the second increment something is really wrong - efficient or not, you shouldn't have any problem dumping 2000, or 20,000 rows into memory on a running JVM. Maybe you're storing the data redundantly or extremely inefficiently?

Steve B.
  • 49,740
  • 11
  • 90
  • 128
  • Thanks for the suggestions. I believe the main problem was I wasn't using the JDBC API in an optimal way. I am able to fetch my data relatively quickly right now in 10k-20k increments. Good suggestion though on only pulling the necessary columns instead of doing a SELECT *. – Ish Jul 06 '09 at 20:31
2

One thing that helped me was Statement.setFetchSize(Integer.MIN_VALUE). I got this idea from Jason's blog. This cut down execution time by more than half. Memory consumed went down dramatically (as only one row is read at a time.)

This trick doesn't work for PreparedStatement, though.

Shashikant Kore
  • 4,844
  • 3
  • 28
  • 40