1

I am currently developing a web application that will be connecting to a PostgreSQL database. I am using Java as my language of choice, and Maven as my build manager (web application template).

My data is contained in many pipe delimited files, with each row being an entity in the table (one file for each table). There are between 20-30 tables, which the total number of rows between them reaching into 10+ million.

I am aware there is a csv importer within psql, and I have used it successfully in other cases. However, I am deciding on which software is optimal for moving forward with my application.

On the surface I know the differences, advantages, disadvantages etc between JDBC and Hibernate and have read various other answers comparing them. I decided Hibernate would be worth learning and would make application optimal. However, here lies the CSV import problem. I know Hibernate does not do this out of the box, and JDBC would be a better choice.

Question is, is JDBC the obvious choice in this scenario? I have read when it is preferred over Hibernate and vice versa, but it seems I have some overlap and can't decide for sure.

Does it make the most sense to just go with JDBC since I can use CopyManager and COPY from STDIN to load my CSV data? I was eager to dive into Hibernate as it seems to have solid advantages over JDBC and would be good tech to know as a new developer, but I am new to this and would love some advice!

**** edit for clarification *****

So I will be loading all of the CSV data into the database, then over time I will likely be adding more tables (likely also from CSV) to the database as I decide to. My web app will essentially allow the client to type an SQL query into an input form on the webpage, which will execute on the database that I've created, and return what is desired. I am using tomcat as my server.

Thank you!

bburc
  • 159
  • 2
  • 11
  • 3
    Hibernate is not weel suited for batch-like operations, like massive inserts. If all your application needs to do is to import CSV files, I'm not even sure JDBC is the best choice. Why not simple scripts? Hibernate is best-suited for typical CRUD operations, potentially including complex use-cases where you need to navigate through a graph of inter-connected entities. – JB Nizet Apr 12 '16 at 20:45
  • 1
    Further to what @JBNizet said, what is your "web application" actually going to *do*? Will it have to import those text files on a regular basis? Why? To what end? Please [edit] your question to clarify. – Gord Thompson Apr 12 '16 at 21:07
  • Sorry, edited for clarification! – bburc Apr 12 '16 at 21:29
  • What does Maven have to do with this? Nothing at all. It's as meaningful as your IDE. – duffymo Apr 13 '16 at 00:53

1 Answers1

1

Essentially a dup of this but won't mark it as such...yet.

Basically, large data loads via straight SQL/JDBC is going to be insanely expensive and long-running, as long as the existing data constraints aren't disabled.

Most ETL (extract-translate-load, for data warehouse) tools avoid straight SQL and work underneath the sheets to do things quick, bypassing SQL avoids query parsing (expensive, though bind parameters mitigate it) and database constraints (either disabling them or using features that allow them to be done all at once at the end) and intermittent commits (because you could leave your data in a bad state). Some database vendors (MSSQL and Oracle, for one) have tools that do basic loads that can help with bulk situations like that.

IMHO you need to do everything possible to avoid JDBC for large batches. If you don't have access to enterprise-level tools, then use techniques to manipulate the file into a format the pg_restore can use. Even if you need to reformat the data once it's in, in most cases it's easier to get the data loaded into a temp table and then go about any other data manipulation/normalization needed.

Community
  • 1
  • 1
Scott Sosna
  • 1,417
  • 1
  • 7
  • 8
  • Thanks for your answer. After some thinking and searching I've created a workflow which hopefully is better, although it does not use pg_restore as you suggested. I plan to write an SQL script that will check if a table exists, and if not, creates it, and then uses COPY to load the csv data into the newly created table. I will run this from my Java program using Runtime.exec(). (I assume this is faster than doing it through JDBC, but either way I am on a 32g machine with 1TB flash storage). ...continued next comment... – bburc Apr 13 '16 at 15:25
  • ...continued...Once the data is loaded, I will use JDBC to execute queries on the database (Maybe Hibernate? Since I'll be quering and just using CRUD operations?). The queries will come from a text field in the web browser from the client, so my next step after this will be the logistics of getting that string and executing it. Thank you for the help, if there is anything major wrong with this please let me know! – bburc Apr 13 '16 at 15:25
  • 1
    Using JDBC to query after the data is loaded or to do CRUD operations on a manageable number of rows is great after the data is loaded, using JDBC to load was my concern. – Scott Sosna Apr 13 '16 at 15:44