How to best stage large amounts of data with Hibernate/JPA?

Question

How can I best stage large amounts of data for migration into our database using Hibernate efficiently? Performance when dealing with >25K records that are 100+ columns are not ideal.

Let me explain:

Background

I'm working for a large company that operates around the world. I've been tasked with leading a team (at least for backend) to create a full stack application that allows for various levels of management to perform their tasks. The current tech stack for backend is Java, Spring Boot, Hibernate, and PostgreSQL. Management would like to upload Excel files to our application and have our application parse them so we can refresh the data in our database.

Unfortunately, these files range from 25K to 50K records. We're aware that these Excel files are generated using SQL queries from Excel. However, we are not permitted to access the database with this data directly. The security is very tight and will not permit us access to any APIs, DB calls, etc. to work around Excel. Due to memory constraints and scalability concerns, we're using SAX parsing to keep a low footprint. Once we parse the Excel files, we're mapping them to a Hibernate entity that represents a staging table. Then we're migrating data from it to our other tables.

Currently to stage 25K records and migrate all the data to our other tables takes 15 minutes, which is unacceptable in the eyes of management. Especially, since this will need to be done on a daily basis.

Things I've tried

Enabling batch processing in Hibernate by following Vlad's answer here. This knocked maybe 20 seconds off the overall time for staging.
Rewriting criteria and other queries for fetching data.
Reducing amount of data to process (most fields are required so the amount can't be too heavily reduced).
Indexing important columns in both the staging and destination tables. I'm doing the indexing as part of schema generation.
Optimize parts of code that clean parsed data of imperfections.

I cannot post code due to NDA

Summary of Constraints

This app needs strong support for generating reports on related data (one of the reasons we went with RDBMS. Also, the data fits well into a relational model).
Must maintain a complete audit history of all records (currently using Hibernate Envers).
We have to approve any new dependency/library through the company's cybersecurity team. This can result in days of lost production while we wait for approval. It's not ideal to request new dependencies for the project.
There are no ways of working around the Excel files at this time. An API call or simple database query would be nice, but that's not an option to us for security reasons.
Scalability is a growing concern. Another team under this project has to parse an Excel file of 50K rows with 100 rows. All of this is only data for the USA. The project owner has said the company eventually wants to expand this app's management capabilities abroad.

My Thoughts

Purely regarding the staging issue, I think it's best to get rid of the Hibernate entities responsible for staging. I'll rewrite the migration of staged data into our live tables in SQL using stored procedures. Despite it being vendor-specific (to my knowledge, anyway) I'll use Postgres' COPY command to do the heavy lifting with the large amounts of rows. I can rewrite the parser to direct data to a CSV or other delimited file instead. The only issue I have then is how to migrate the data to tables that use Hibernate sequences and generators. I haven't figured out how to synchronize Hibernate's sequences after a manual update to the database like that. It likes the throw errors about duplicate primary keys until it comes across an ID in the sequence that's not used. But I feel that's another question entirely.

Edit 1:

I should clarify. The 15 minutes is the total time for all of staging. This includes staging and migration. Just the staging of the 25K records takes around 1:30, which also isn't ideal. I've run session metrics a few times and get around the following numbers for Spring Data persisting the 25K records:

2451000 nanoseconds spent acquiring 1 JDBC connection;
0 nanoseconds spent releasing 0 JDBC connections;
96970800 nanoseconds spent preparing 24851 JDBC statements;
9534006000 nanoseconds spent executing 24849 JDBC statements;
21666942900 nanoseconds spent executing 830 JDBC statements;
23513568700 nanoseconds spent executing 2 flushes (flushing a total of 49696 entities and 0 collections)
211588700 nanoseconds spent executing 1 partial-flushes (flushing a total of 24848 entities and 24848 collections)

For this specific case, I'm staging the roughly 25K entities and then using a stored procedure to move only employee data from staging to live tables (a small fraction of what makes up the 15 total minutes). That procedure seems to run instantly. But there's other data that we have to determine via joins, group by statements, etc., which appear to be costly. I'm just not sure why it's taking Spring Data so long to persist that many records when it would take pure SQL significantly less.

Have you tried analyzing where exactly it spends too much time? This can help you identify the places you can optimize. Maybe using the JDBC directly may help you since you are not relying at all on what spring data jpa does. If you have validation on the data you are reading, especially against the database then that would be another optimization point (reducing the trips to database). From your description I am not sure what to suggest since I don't know where the problem is. 25k records in an excel file is not that much to read and process (also a batch with all 25k records is not a DBissue — tzortzik, Feb 07 '21 at 18:11
I appreciate the advice. I've posted a JPA generated statistic for persisting the 25K records and tried to clarify a bit. We perform data parsing and validation in Java atm. Nothing on the database aside from a few casts if needed. So there shouldn't be any unnecessary round trips occurring. I'll keep trying to optimize and give JDBC a shot as well. Honestly, it doesn't make sense to me to have a full-blown Hibernate entity for a staging table. So maybe JDBC or a procedure is the better option. — ElderFuthark, Feb 07 '21 at 21:35
Inserting 25000 rows into an unlogged table with 100 float columns takes 200 milliseconds on a 2009 vintage Core 2 quad. The timing info given in the question only covers less than a minute of the whole 15 minutes, so where does it spend all that time?... Can you break it down into xml parsing and other steps, give more details on what those 25000 queries are doing (if it executes one insert per row, that's a problem) and where are the 14 remaining minutes spent? I don't use Java, but I'm sure it has a competent profiler that can give you a full list of time spent in every function... — bobflux, Feb 07 '21 at 23:47
I mean, maybe it spends all that time parsing xml or something, and you're wasting your time thinking about the database. Or maybe it's a database problem. Without a detailed profile from start to finish of the complete process, impossible to know. Yet, the 24849 JDBC statements hint that it's doing a query per line, and this would be a problem. But it only explains 40 seconds out of the 15 minutes. — bobflux, Feb 07 '21 at 23:51
OK. I've never used Java, but googling for ["hibernate batch inserts"](https://www.baeldung.com/jpa-hibernate-batch-insert-update) reveals that "If our entities use GenerationType.IDENTITY identifier generator, Hibernate will silently disable batch inserts/updates." So if you don't use postgres sequences to generate the primary key (what a weird idea) then maybe that explains why hibernate generates so many queries... please check first what the 25k queries are though. Maybe they're selects for validation. — bobflux, Feb 07 '21 at 23:56
@ElderFuthark If your entity has defined relations then this may be an issue. Hibernate will try to their relations for every single entity and this take a lot of time. Try to remove those relations and see what happens. Usually round trips to-from database are time consuming. — tzortzik, Feb 08 '21 at 05:55
There are tons of questions and answers around this topic, please test those and let us know the results. Just a few examples: https://stackoverflow.com/questions/50772230/how-to-do-bulk-multi-row-inserts-with-jparepository https://stackoverflow.com/questions/20285347/massive-insert-with-jpa-hibernate https://stackoverflow.com/questions/38424001/using-batch-insert-in-hibernate-jpa https://stackoverflow.com/questions/47632045/spring-boot-jpa-bulk-insert https://stackoverflow.com/questions/25949777/batch-insert-with-jpa-and-spring — Jens Schauder, Feb 08 '21 at 07:24

How to best stage large amounts of data with Hibernate/JPA?

0 Answers0