1

Task:

Given this HashMap structure: Map<String, Map<String, String>> mainMap = new HashMap<>()

I want to INSERT or COPY each value of the inner Map into its own cell in my database.

  • The size() of mainMap if 50,000.
  • The size() of the inner Map is 50.
  • The table to be inserted into has 50 columns.
  • Each column's header is the key for the inner Map.

EDIT: Originally the user uploads a large spreadsheet with 35 of the 50 columns. I then "cleanse" that data with various formatting, and I add my own 15 new pairs into the innerMap for each mainMap entry. I can't directly COPY from the user's source file to my database without cleansing/formatting/adding.

Once I'm done iterating the spreadsheet and building mainMap, that's when I need to insert into my database table efficiently.

Research:

I've read that COPY is the best approach to initially bulk populate a table, however I'm stuck on whether my requirements warrant that command.

This post states that Postgres has a Prepared Statement parameter limit of 34464 for a query.

I'm assuming I need 50 x 50,000 = 2,500,000 parameters in total. This equals out to ~ 73 individual queries!

Question:

  • Is COPY the proper approach here instead of all these parameters?
  • If so, do I convert the HashMap values into a .sql file, save it on disk on my web app server, and then reference that in my COPY command, and then delete the temp file? Or can I directly pass a concatenated String into it, without risking SQL injection?

This command will be happening often, hence the need to be optimized.

I can't find any examples of converting Java objects into compatible Postgres text file formats, so any feedback helps.

How would you approach this problem?

Additional Info:

My table is pre-existing and can't be deleted since it's the back-end for my webapp and multiple users are connected at any given time.

I understand temporarily removing indexes prior to using COPY can increase performance, but I'm only requiring max 50,000 rows to be inserted or copied at a time, not millions.

StackExchange told me to ask here.

Mathomatic
  • 837
  • 1
  • 11
  • 33
  • the question is from where do the objects come from? – Pali May 01 '19 at 17:45
  • okay can you process the user's data and persist it in the database in a single request? this would eleminate the need of the mainMap and processing big batches – Pali May 01 '19 at 17:51
  • Once a user uploads a spreadsheet, I have to iterate row-by-row. So, after I iterate the first row for example, what do I do with that cleansed data prior to processing the next row? I don't see how I can "persist" that single row's data in a database without sending an expensive INSERT query. That's why I'm storing it in RAM, so after all iterations are complete, I bulk INSERT/COPY with minimal queries. What am I not understanding? – Mathomatic May 01 '19 at 17:55
  • if I understood you correctly the mainMap represents a single spreadsheet from a single user and counts 50k rows with 50 columns per row? – Pali May 01 '19 at 17:58
  • No, the user uploads a spreadsheet with max 50,000 rows of data, but only 35 columns. I then need to iterate row-by-row to cleanse each cell of that data in different ways. Based on that row's data, I then add my own additional 15 pairs to the inner `Map`. Hence the target table having 50 columns. `mainMap` by the time it's fully composed is essentially an identical data replication of what the target table should have. – Mathomatic May 01 '19 at 18:01
  • how about creating a csv file and importing it: https://stackoverflow.com/q/2987433/1879409 – Pali May 01 '19 at 18:02
  • Do you think a CSV import would be more efficient than Ancoron's answer below? – Mathomatic May 01 '19 at 18:30
  • @Mathomatic Whether or not a CSV import would be more efficient largely depends on how long it takes to convert that data to CSV in the first place (or if you can get it in a `COPY`-friendly format in the first place). – Bruno May 01 '19 at 18:32

2 Answers2

4

While Java is certainly not the best option to do this kind of ETL, it certainly is possible and with rather little overhead using standard INSERT statements and prepared queries:

conn.setAutoCommit(false);
PreparedStatement stmt = conn.prepareStatement(
        "INSERT INTO my_table (col_a, col_b, ...)"
        + " VALUES (?, ?, ...)");
int batchSize = 1000;
int rows = 0;
for (Map<String, String> values : mainMap.values()) {
    int i = 0;
    stmt.setString(++i, values.get("col_a"));
    stmt.setString(++i, values.get("col_b"));
    // ...
    stmt.addBatch(); // add the row to the batch
    if (++rows % batchSize == 0) {
        // batch-sizing: execute...
        stmt.executeBatch();
    }
}

if (rows % batchSize != 0) {
    // a last execution if necessary...
    stmt.executeBatch();
}
conn.commit(); // atomic action - if any record fails, the whole import will fail

Alternatively, you could write out the Map into a file and use the CopyManager, but I seriously doubt this would be any faster than with the batched inserts (would be different for millions of rows, though).

Ancoron
  • 1,899
  • 1
  • 5
  • 15
  • This looks promising. I'll report back. Thanks – Mathomatic May 01 '19 at 18:31
  • I feel compelled to accept your answer as it was first, correct, cohesive and also provided interesting info about `conn.commit()` I didn't know. Thanks for the support. +1 – Mathomatic May 01 '19 at 19:02
2

COPY may indeed be the recommended way for initial bulk uploads, but there are limitations considering your initial data is stored in memory in a Java Map:

  • Firstly, it expects to load from a file (local to the server, and readable by its user), or a program (again, executed locally on the server), or via STDIN. None of these options are particularly friendly to a JDBC connection.
  • Secondly, even if you could prepare the data in that format (assuming you're on the same machine to prepare such a file, for example), you'd still need to convert the data held in memory in Java into the format COPY expects. That processing probably wouldn't make it worth using COPY.

I would instead create a PreparedStatement to insert your 50 columns, and then iterate through to execute that prepared statement for each Map in mainMap.values() (i.e. 50 columns each time).

You can gain speed using executeBatch(). That said, I wouldn't execute all the 50000 in one batch, but in sub-batches.

I'd do something like this:

    int BATCH_SIZE = 100;

    List<String> keyNames = new ArrayList<>();

    int i = 0;
    try (PreparedStatement ps = conn
            .prepareStatement("INSERT INTO xyz (col1, col2, ...) VALUES (?, ?, ...)")) {
        for (Map<String, String> rowMap : mainMap.values()) {
            int j = 1;
            // You need the keynames to be in the same order as the columns
            // they match.
            for (String key : keyNames) {
                ps.setString(j, rowMap.get(key));
                j++;
            }
            ps.addBatch();

            if (i > 0 && i % BATCH_SIZE == 0) {
                ps.executeBatch();
            }
            i++;
        }
        if (i % BATCH_SIZE != 1) {
            // More batches to execute since the last time it was done.
            ps.executeBatch();
        }
    }
Bruno
  • 110,518
  • 24
  • 258
  • 357
  • Thanks Bruno, I'm spending some time now to really understand your guys answers before replying further. – Mathomatic May 01 '19 at 18:31
  • I had to accept the other as it was first and also correct, but yours is very similar so thank you. You validated my concern about using `COPY` here and taught me some interesting tidbits. +1 – Mathomatic May 01 '19 at 19:05