0

I am trying to speed up inserts into a schema. I have seen some questions already about this however they do not deal with my issue, these are the most relevant:

The tables within my schema all include auto generated keys Serial, so I am using RETURNING id at the end of the inserts. The reason I am not using copy is that I need to check for constraints. Specifically some fields have unique indices, therefore I am checking if a specific row already exists so I can select its id (For the sake of mapping on the auto generated keys)

I do this:

sql = "INSERT INTO %s.%s(%s) VALUES (%s) ON CONFLICT DO NOTHING RETURNING %s;"
sql = sql % (schema_name,table_name,','.join(head[1:]),'%s',head[0])

where head contains the name of the fields in the table. Then I do this:

    try:
        # get the auto generated id back
        return_id = pdb.cur.fetchone()[0]     
    except TypeError:
        sql_exist_id = "SELECT %s FROM %s.%s WHERE %s = '%s'" % (head[0],schema_name,table_name,unique_field,row[1])

My question is: Is there a way to do a batch insert (like this psycopg2: insert multiple rows with one query) all while checking for constraints and returning keys ?

AnarKi
  • 585
  • 3
  • 21
  • 1
    You should not be using string formatting to create queries, this is vulnerable to SQL injection. Instead, use parameterized queries. – roganjosh May 09 '18 at 12:45
  • Hmm, maybe bulk insert to a staging table, left join to check for dupe keys, then insert into the real table? I am fairly certain the answer to "can you return all keys from a bulk operation" is "no". Another option may be to insert a secondary/external key, this way you have a set keys before inserting and after. – Jacob H May 09 '18 at 13:06
  • @JacobH "can you return all keys from a bulk operation" was my main interest... anyway thanks for the suggestion. Another way I thought about was to check if any of the "to be inserted rows" exist on the unique index, save their position in the row list, then insert all rows ignoring the ones that exist(on conflict do nothing), then use the ids from the previous select(checking if any row already esxists) – AnarKi May 09 '18 at 13:10
  • @AnarKi I am more of a SQL Server person so it's possible that it could be different in postgres. Maybe this helps? https://stackoverflow.com/questions/5875953/returning-multiple-serial-values-from-posgtres-batch-insert – Jacob H May 09 '18 at 13:13
  • There's limited support for the type of thing you're describing... http://sqlfiddle.com/#!17/600ca/10 – MatBailie May 09 '18 at 13:13

1 Answers1

2

You can get pretty close...

The RETURNING clause will work on multi-value inserts, like so:

insert into some_table (col1, col2) values 
    (val1, val2), (val3, val4), 
    (val5, val6), (val7, val8) 
    returning id 

Assuming some_table has a serial, you will get 4 results back from that statement. It will also work with insert into ... select:

insert into some_table (col1, col2) 
    select col1, col2 
      from some_other_source returning id

But in both cases though, if you specify on conflict do nothing, the entire statement will fail in case of a conflict, so one bad record kills the transaction.

But note that returning also works on deletes:

delete from some_table where col1 = 'bob' returning id

Putting it together with the suggestion from @JacobH: bulk load to a temp table, filter out conflicting rows, then insert the remaining, non-conflicting rows. You mentioned copy, so I'm assuming there is already .csv that roughly matches the destination. Something like this:

with conn.cursor() as stmt:
    #create a constraint-free temp table based on your destination table
    #the `where 0 = 1` ensures it's empty
    stmt.execute('select * into temp tmp_some_table from some_table where 0 = 1') 

    #bulk load into the temp table:
    with open('some.csv') as infile:
        stmt.copy_from(infile, 'tmp_some_table', sep=',', columns=['col1', 'col2'])

    #strip out your conflicts. note that we are returning col1 (assumed to be a unique key)
    #you can do as many passes as needed...
    stmt.execute('delete from tmp_some_table tst using some_table st where tst.col1 = st.col1 returning col1')
    dupe_keys = stmt.fetchall() #all your deleted, conflicting keys

    #bulk insert non-conflicting values
    stmt.execute('insert into some_table (col1, col2) select (col1, col2) from tmp_some_table returning id') 
    inserted_ids = stmt.fetchall() #all the ids created from the insert

A bonus is that this should be very fast vs. iterating and calling execute - in essence you are doing everthing in 4 statements; the bulk of the network I/O is on that initial csv import vs. N round trips on N record inserts. Everything else happens in the database. The temp table will probably stay in memory if it's reasonably small.

Realize I'm late to the part on this answer, but I hope it helps.

bimsapi
  • 4,665
  • 2
  • 16
  • 26