0

We have a large table of approximately 1 million rows, and a data file with millions of rows. We need to regularly merge a subset of the data in the text file into a database table.

The main reason for it being slow is that the data in the file has references to other JPA objects, meaning the other jpa objects need to be read back for each row in the file. ie Imagine we have 100,000 people, and 1,000,000 asset objects

Person object --> Asset list

Our application currently uses pure JPA for all of its data manipulation requirements. Is there an efficient way to do this using JPA/ORM methodologies or am I going to need to revert back to pure SQL and vendor specific commands?

  • Can you give some more details about your merging process/requirements? Is your concern about JPA related to performance or just 'how can I do this'? – Jim Tough Dec 02 '10 at 23:29
  • Yes the question is just "how can I do this". Obviously we could just read the file line by line, and attempt to insert any rows that do not already exist. This takes approximately 9 hours using an oracle database over a fairly fast LAN connection. – James Brown Dec 02 '10 at 23:58

2 Answers2

1

why doesnt use age old technique: divide and conquer? Split the file into small chunks and then have parallel processes work on these small files concurrently.

And use batch inserts/updates that are offered by JPA and Hibernate. more details here

The ideal way in my opinion though is to use batch support provided by plain JDBC and then commit at regular intervals.

You might also wants to look at spring batch as it provided split/parallelization/iterating through files etc out of box. I have used all of these successfully for an application of considerable size.

Community
  • 1
  • 1
Aravind Yarram
  • 74,434
  • 44
  • 210
  • 298
0

One possible answer which is painfully slow is to do the following

  • For each line in the file:

    • Read data line
    • fetch reference object
    • check if data is attached to reference object
    • if not add data to reference object and persist

So slow it is not worth considering.