312

A very frequently asked question here is how to do an upsert, which is what MySQL calls INSERT ... ON DUPLICATE UPDATE and the standard supports as part of the MERGE operation.

Given that PostgreSQL doesn't support it directly (before pg 9.5), how do you do this? Consider the following:

CREATE TABLE testtable (
    id integer PRIMARY KEY,
    somedata text NOT NULL
);

INSERT INTO testtable (id, somedata) VALUES
(1, 'fred'),
(2, 'bob');

Now imagine that you want to "upsert" the tuples (2, 'Joe'), (3, 'Alan'), so the new table contents would be:

(1, 'fred'),
(2, 'Joe'),    -- Changed value of existing tuple
(3, 'Alan')    -- Added new tuple

That's what people are talking about when discussing an upsert. Crucially, any approach must be safe in the presence of multiple transactions working on the same table - either by using explicit locking, or otherwise defending against the resulting race conditions.

This topic is discussed extensively at Insert, on duplicate update in PostgreSQL?, but that's about alternatives to the MySQL syntax, and it's grown a fair bit of unrelated detail over time. I'm working on definitive answers.

These techniques are also useful for "insert if not exists, otherwise do nothing", i.e. "insert ... on duplicate key ignore".

Community
  • 1
  • 1
Craig Ringer
  • 259,831
  • 56
  • 584
  • 684
  • 1
    possible duplicate of [Insert, on duplicate update in PostgreSQL?](http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql) – Michael Hampton Aug 11 '14 at 04:01
  • 10
    @MichaelHampton the goal here was to create a definitive version that's not confused by multiple outdated answers - and locked, so nobody can do anything about it. I disagree with the closevote. – Craig Ringer Aug 11 '14 at 08:10
  • Why, then this would soon become outdated - and locked, so nobody could do anything about it. – Michael Hampton Aug 11 '14 at 10:42
  • 2
    @MichaelHampton If you're concerned, perhaps you could flag the one you linked to and ask for it to be unlocked so it can be cleaned up, then we can merge this in. I'm just sick of having the only obvious close-as-dup for upsert being such a confusing and wrong mess. – Craig Ringer Aug 11 '14 at 11:20
  • 1
    That Q&A is not locked! – Michael Hampton Aug 11 '14 at 11:21
  • I found this more useful - https://stackoverflow.com/a/36799500/80428 – Jay Wick May 31 '17 at 14:13

6 Answers6

439

9.5 and newer:

PostgreSQL 9.5 and newer support INSERT ... ON CONFLICT (key) DO UPDATE (and ON CONFLICT (key) DO NOTHING), i.e. upsert.

Comparison with ON DUPLICATE KEY UPDATE.

Quick explanation.

For usage see the manual - specifically the conflict_action clause in the syntax diagram, and the explanatory text.

Unlike the solutions for 9.4 and older that are given below, this feature works with multiple conflicting rows and it doesn't require exclusive locking or a retry loop.

The commit adding the feature is here and the discussion around its development is here.


If you're on 9.5 and don't need to be backward-compatible you can stop reading now.


9.4 and older:

PostgreSQL doesn't have any built-in UPSERT (or MERGE) facility, and doing it efficiently in the face of concurrent use is very difficult.

This article discusses the problem in useful detail.

In general you must choose between two options:

  • Individual insert/update operations in a retry loop; or
  • Locking the table and doing batch merge

Individual row retry loop

Using individual row upserts in a retry loop is the reasonable option if you want many connections concurrently trying to perform inserts.

The PostgreSQL documentation contains a useful procedure that'll let you do this in a loop inside the database. It guards against lost updates and insert races, unlike most naive solutions. It will only work in READ COMMITTED mode and is only safe if it's the only thing you do in the transaction, though. The function won't work correctly if triggers or secondary unique keys cause unique violations.

This strategy is very inefficient. Whenever practical you should queue up work and do a bulk upsert as described below instead.

Many attempted solutions to this problem fail to consider rollbacks, so they result in incomplete updates. Two transactions race with each other; one of them successfully INSERTs; the other gets a duplicate key error and does an UPDATE instead. The UPDATE blocks waiting for the INSERT to rollback or commit. When it rolls back, the UPDATE condition re-check matches zero rows, so even though the UPDATE commits it hasn't actually done the upsert you expected. You have to check the result row counts and re-try where necessary.

Some attempted solutions also fail to consider SELECT races. If you try the obvious and simple:

-- THIS IS WRONG. DO NOT COPY IT. It's an EXAMPLE.

BEGIN;

UPDATE testtable
SET somedata = 'blah'
WHERE id = 2;

-- Remember, this is WRONG. Do NOT COPY IT.

INSERT INTO testtable (id, somedata)
SELECT 2, 'blah'
WHERE NOT EXISTS (SELECT 1 FROM testtable WHERE testtable.id = 2);

COMMIT;

then when two run at once there are several failure modes. One is the already discussed issue with an update re-check. Another is where both UPDATE at the same time, matching zero rows and continuing. Then they both do the EXISTS test, which happens before the INSERT. Both get zero rows, so both do the INSERT. One fails with a duplicate key error.

This is why you need a re-try loop. You might think that you can prevent duplicate key errors or lost updates with clever SQL, but you can't. You need to check row counts or handle duplicate key errors (depending on the chosen approach) and re-try.

Please don't roll your own solution for this. Like with message queuing, it's probably wrong.

Bulk upsert with lock

Sometimes you want to do a bulk upsert, where you have a new data set that you want to merge into an older existing data set. This is vastly more efficient than individual row upserts and should be preferred whenever practical.

In this case, you typically follow the following process:

  • CREATE a TEMPORARY table

  • COPY or bulk-insert the new data into the temp table

  • LOCK the target table IN EXCLUSIVE MODE. This permits other transactions to SELECT, but not make any changes to the table.

  • Do an UPDATE ... FROM of existing records using the values in the temp table;

  • Do an INSERT of rows that don't already exist in the target table;

  • COMMIT, releasing the lock.

For example, for the example given in the question, using multi-valued INSERT to populate the temp table:

BEGIN;

CREATE TEMPORARY TABLE newvals(id integer, somedata text);

INSERT INTO newvals(id, somedata) VALUES (2, 'Joe'), (3, 'Alan');

LOCK TABLE testtable IN EXCLUSIVE MODE;

UPDATE testtable
SET somedata = newvals.somedata
FROM newvals
WHERE newvals.id = testtable.id;

INSERT INTO testtable
SELECT newvals.id, newvals.somedata
FROM newvals
LEFT OUTER JOIN testtable ON (testtable.id = newvals.id)
WHERE testtable.id IS NULL;

COMMIT;

Related reading

What about MERGE?

SQL-standard MERGE actually has poorly defined concurrency semantics and is not suitable for upserting without locking a table first.

It's a really useful OLAP statement for data merging, but it's not actually a useful solution for concurrency-safe upsert. There's lots of advice to people using other DBMSes to use MERGE for upserts, but it's actually wrong.

Other DBs:

Craig Ringer
  • 259,831
  • 56
  • 584
  • 684
  • In the bulk upsert, is there possible value in deleting from newvals rather than filtering the INSERT? E.g. WITH upd AS (UPDATE ... RETURNING newvals.id) DELETE FROM newvals USING upd WHERE newvals.id = upd.id, followed by a bare INSERT INTO testtable SELECT * FROM newvals? My idea with this: instead of filtering twice in INSERT (for the JOIN/WHERE and for the unique constraint), reuse the existence check results from the UPDATE, which are in RAM already, and may be much smaller. This may be a win if few rows matched and/or newvals is much smaller than testtable. – Gunnlaugur Briem Jul 31 '14 at 18:59
  • 1
    There are still unresolved issues and for the other vendors it is not clear what works and what does not. 1. The Postgres looping solution as noted does not work in the case of multiple unique keys. 2. The on duplicate key for mysql also does not work for multiple unique keys. 3. Do the other solutions for MySQL, SQL Server and Oracle posted above work? Are exceptions possible in those cases and do we have to loop? – dan b Jan 27 '15 at 21:02
  • @danb This is only really about PostgreSQL. There is no cross-vendor solution. The solution for PostgreSQL does not work for multiple rows, you must do one transaction per row unfortunately. The "solutions" that use `MERGE` for SQL Server and Oracle are incorrect and prone to race conditions, as noted above. You'll need to look into each DBMS specifically to find out how to handle them, I can really only offer advice on PostgreSQL. The only way to do a safe multi-row upsert on PostgreSQL will be if support for native upsert gets added to the core server. – Craig Ringer Jan 31 '15 at 22:53
  • Even for PostGresQL the solution does not work in the case where a table has multiple unique keys (updating only one row). In that case you need to specify which key is being updated. There may be a cross-vendor solution using jdbc for example. – dan b Feb 01 '15 at 07:07
  • @danb As there is no upsert specified in the SQL standard (and no, `MERGE` does not count, it's an OLAP operation that doesn't consider concurrency) there's really not a simple portable way to do it. Nor does JDBC define anything at the Java API layer - read the spec and api doc, you'll see nothing like what you are looking for. Until the SQL committee get around to defining an upsert spec (with no doubt awful syntax and weird behavioural quirks) you'll have to use vendor-specific approaches. – Craig Ringer Feb 01 '15 at 10:44
  • It cannot be done as a stored procedure, however, you could potentially implement a method that, for example, implements upsert by inserting the values, and doing an update if you catch a duplicate key exception. More care is needed to set the Transaction Isolation levels and retry in the case of deadlock. – dan b Feb 01 '15 at 12:22
  • Quick question… [this blog post](http://tapoueh.org/blog/2013/03/15-batch-update.html) suggests a `SHARE ROW EXCLUSIVE` lock instead, and it appears as though that weaker lock would work just as well. Is there a specific reason you're using `EXCLUSIVE` here? – jasonmp85 Mar 16 '15 at 22:45
  • @jasonmp85 [Docs](http://www.postgresql.org/docs/current/static/explicit-locking.html). `SHARE ROW EXCLUSIVE` permits `ROW SHARE`, which `EXCLUSIVE` does not. Since `ROW SHARE` is only acquired by `SELECT ... FOR UPDATE` and `SELECT ... FOR SHARE` it won't cause data errors, but it might permit deadlocks due to row-lock-ordering differences. What's wrong with `EXCLUSIVE`? It still permits `SELECT`, and `SHARE ROW EXCLUSIVE` doesn't gain you much over that. – Craig Ringer Mar 17 '15 at 09:57
  • Nothing's wrong with `EXCLUSIVE`, I was only curious about the discrepancy and couldn't quite make out what the effect was of using the weaker lock. You've clarified it nicely. Thanks! – jasonmp85 Mar 17 '15 at 17:35
  • Thanks for the clarification on concurrent inserts.. deleted my answer. – Matt Bannert Mar 28 '15 at 12:08
  • 2
    Postgres now supports UPSERT - http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=168d5805e4c08bed7b95d351bf097cff7c07dd65 – Chris May 08 '15 at 06:45
  • @Chris Or it will in 9.5, and it's not spelled `UPSERT`. But yeah. Huge probs to Peter for getting this going. – Craig Ringer May 08 '15 at 06:47
  • Is it acceptable to populate `TEMPORARY TABLE newvals` outside of transaction? I have a system which requires to upsert a large amount of data over a long period of time. So I can either buffer the data in memory and then begin transaction, upsert, commit - or insert the data into temporary table once it hits my side. So, is it OK to create temp table outside of transaction? – Denis Gorbachev Jul 12 '15 at 14:46
  • @DenisGorbachev Yes, but I would probably use an unlogged table so you don't lose the buffered data if your session disconnects. Please post a new question rather than commenting here. – Craig Ringer Jul 12 '15 at 22:25
  • I found some more help on how to perform the update with the FROM with the answer posted at http://stackoverflow.com/questions/3845718/sql-how-to-update-table-values-from-another-table-with-the-same-user-name – zerocog Oct 18 '16 at 21:58
  • @CraigRinger, @jasonmp85: one reason that `SHARE ROW EXCLUSIVE` might be preferable is that it allows `VACUUM` operations to proceed. Sooner or later you're going to need a `VACUUM`, and you don't really want all your updates to get blocked behind it. – richvdh Jul 26 '18 at 10:22
  • Is it possible to get a count of successful `UPSERT`ed rows? Probably relevant to: https://stackoverflow.com/questions/32847267/how-to-get-affected-row-count-of-table-inside-wcte-based-upsert-in-postgresql. – shellcat_zero Sep 27 '18 at 09:34
  • The update queue for this question is full, but it's important to note that `ON CONFLICT UPDATE` is actually `ON CONFLICT (key) DO UPDATE`. – dimiguel Jun 28 '20 at 16:26
  • 1
    Examples are golden - so why aren't there any? Actual working example, not three dots something, syntactically correct example of an actual UPSERT. Can we have that please? edit - found some examples here https://kb.objectrocket.com/postgresql/how-to-perform-the-postgresql-upsert-in-existing-table-1260 would be nice if SO could be last stop when googling this. Edit #2 found some examples in answers bellow, shame it's not in accepted one. – Petr Dec 28 '20 at 15:11
  • @Petr I linked to the user manual. And I wrote this before the feature you're talking about even existed. Want it changed? Propose an edit and add the examples you want yourself! – Craig Ringer Jan 12 '21 at 09:22
42

Here are some examples for insert ... on conflict ... (pg 9.5+) :

  • Insert, on conflict - do nothing.
    insert into dummy(id, name, size) values(1, 'new_name', 3)
    on conflict do nothing;`  
    
  • Insert, on conflict - do update, specify conflict target via column.
    insert into dummy(id, name, size) values(1, 'new_name', 3)
    on conflict(id)
    do update set name = 'new_name', size = 3;  
    
  • Insert, on conflict - do update, specify conflict target via constraint name.
    insert into dummy(id, name, size) values(1, 'new_name', 3)
    on conflict on constraint dummy_pkey
    do update set name = 'new_name', size = 4;
    
user218867
  • 16,252
  • 12
  • 112
  • 156
  • great answer - question: why or in what situation should one use target specification via column or constraint name? Is there an advantage/disadvantage for various use-cases? – Nathan Benton Apr 27 '20 at 16:36
  • 1
    @NathanBenton I think there are at least 2 differences: (1) column name is specified by programmer, while constraint name might be either specified by programmer, or generated by database according to table / column names. (2) each column might have multiple constraints. That said, it depends on your case to choose which one to use. – user218867 Apr 27 '20 at 17:06
  • simple, easy to understand, kudo! – Jeb50 Feb 05 '21 at 19:45
33

I am trying to contribute with another solution for the single insertion problem with the pre-9.5 versions of PostgreSQL. The idea is simply to try to perform first the insertion, and in case the record is already present, to update it:

do $$
begin 
  insert into testtable(id, somedata) values(2,'Joe');
exception when unique_violation then
  update testtable set somedata = 'Joe' where id = 2;
end $$;

Note that this solution can be applied only if there are no deletions of rows of the table.

I do not know about the efficiency of this solution, but it seems to me reasonable enough.

Renzo
  • 24,048
  • 4
  • 43
  • 54
  • 3
    Thank you, that's exactly what I was looking for. Can't understand why it was so hard to find. – isapir Dec 29 '15 at 02:17
  • 4
    Yep. This simplification works if and only if there are no deletes. – Craig Ringer Jan 18 '16 at 11:24
  • @CraigRinger Can you explain what exactly will happen if there was deletes? – turbanoff Feb 03 '19 at 14:03
  • @turbanoff The insert can fail because the record is already there, then it's deleted concurrently, and the update then affects zero rows because the row was deleted. – Craig Ringer Feb 04 '19 at 04:51
  • @CraigRinger So. *Deleting is happens concurrently*. What are possible outways if this _is_ works fine? If deleting is working concurrently - then it can be executed just after our block. What i'm trying to say - if we have concurrent deleting - then this code woks in the *same* manner as proper `insert on update` – turbanoff Feb 04 '19 at 12:28
  • @turbanoff I cannot understand what you said. Please post a new question if you want something in depth. – Craig Ringer Feb 04 '19 at 13:10
10

SQLAlchemy upsert for Postgres >=9.5

Since the large post above covers many different SQL approaches for Postgres versions (not only non-9.5 as in the question), I would like to add how to do it in SQLAlchemy if you are using Postgres 9.5. Instead of implementing your own upsert, you can also use SQLAlchemy's functions (which were added in SQLAlchemy 1.1). Personally, I would recommend using these, if possible. Not only because of convenience, but also because it lets PostgreSQL handle any race conditions that might occur.

Cross-posting from another answer I gave yesterday (https://stackoverflow.com/a/44395983/2156909)

SQLAlchemy supports ON CONFLICT now with two methods on_conflict_do_update() and on_conflict_do_nothing():

Copying from the documentation:

from sqlalchemy.dialects.postgresql import insert

stmt = insert(my_table).values(user_email='a@b.com', data='inserted data')
stmt = stmt.on_conflict_do_update(
    index_elements=[my_table.c.user_email],
    index_where=my_table.c.user_email.like('%@gmail.com'),
    set_=dict(data=stmt.excluded.data)
    )
conn.execute(stmt)

http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html?highlight=conflict#insert-on-conflict-upsert

P.R.
  • 3,061
  • 21
  • 40
  • 4
    Python and SQLAlchemy are not mentioned in the question. – Alexander Emelianov Nov 01 '17 at 12:04
  • I often use Python in the solutions I write. But I haven't looked into SQLAlchemy (or was aware of it). This seems an elegant option. Thank you. If it checks out, I'll present this to my organization. – Robert Aug 07 '19 at 13:34
4
WITH UPD AS (UPDATE TEST_TABLE SET SOME_DATA = 'Joe' WHERE ID = 2 
RETURNING ID),
INS AS (SELECT '2', 'Joe' WHERE NOT EXISTS (SELECT * FROM UPD))
INSERT INTO TEST_TABLE(ID, SOME_DATA) SELECT * FROM INS

Tested on Postgresql 9.3

aristar
  • 385
  • 3
  • 5
  • @CraigRinger : could you elaborate on this? isn't the cte atomic ? – parisni Nov 17 '19 at 14:59
  • 2
    @parisni No. Each CTE term gets its own snapshot if it performs writes. Also there's no sort of predicate locking performed on rows that were *not* found so they can still be created concurrently by another session. If you used `SERIALIZABLE` isolation you'd get an abort with a serialization failure, otherwise you'd probably get a unique violation. Don't reinvent upsert, the reinvention will be wrong. Use `INSERT ... ON CONFLICT ...`. If your PostgreSQL is too old, update it. – Craig Ringer Dec 06 '19 at 04:20
  • @CraigRinger `INSERT ... ON CLONFLICT ...` is not intended for bulk loading. From your post, the `LOCK TABLE testtable IN EXCLUSIVE MODE;` within a CTE is a workaround to get atomic things. No ? – parisni Dec 06 '19 at 13:22
  • @parisni It's not intended for bulk loading? Says who? https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT . Sure, it's much slower than bulk loading without upsert-like behaviour, but that's obvious and will be the case no matter what you do. It's way faster than using subtransactions, that's for sure. The fastest approach is to *lock the target table* then do an `insert ... where not exists ...` or similar, of course. – Craig Ringer Dec 11 '19 at 03:09
  • ```WITH upsert AS ( UPDATE tbl SET foo = 42 RETURNING * ) INSERT INTO tbl(foo) SELECT 42 WHERE NOT EXISTS (SELECT * FROM upsert);``` - this working for me – GLeBaTi May 11 '21 at 12:57
1

Since this question was closed, I'm posting here for how you do it using SQLAlchemy. Via recursion, it retries a bulk insert or update to combat race conditions and validation errors.

First the imports

import itertools as it

from functools import partial
from operator import itemgetter

from sqlalchemy.exc import IntegrityError
from app import session
from models import Posts

Now a couple helper functions

def chunk(content, chunksize=None):
    """Groups data into chunks each with (at most) `chunksize` items.
    https://stackoverflow.com/a/22919323/408556
    """
    if chunksize:
        i = iter(content)
        generator = (list(it.islice(i, chunksize)) for _ in it.count())
    else:
        generator = iter([content])

    return it.takewhile(bool, generator)


def gen_resources(records):
    """Yields a dictionary if the record's id already exists, a row object 
    otherwise.
    """
    ids = {item[0] for item in session.query(Posts.id)}

    for record in records:
        is_row = hasattr(record, 'to_dict')

        if is_row and record.id in ids:
            # It's a row but the id already exists, so we need to convert it 
            # to a dict that updates the existing record. Since it is duplicate,
            # also yield True
            yield record.to_dict(), True
        elif is_row:
            # It's a row and the id doesn't exist, so no conversion needed. 
            # Since it's not a duplicate, also yield False
            yield record, False
        elif record['id'] in ids:
            # It's a dict and the id already exists, so no conversion needed. 
            # Since it is duplicate, also yield True
            yield record, True
        else:
            # It's a dict and the id doesn't exist, so we need to convert it. 
            # Since it's not a duplicate, also yield False
            yield Posts(**record), False

And finally the upsert function

def upsert(data, chunksize=None):
    for records in chunk(data, chunksize):
        resources = gen_resources(records)
        sorted_resources = sorted(resources, key=itemgetter(1))

        for dupe, group in it.groupby(sorted_resources, itemgetter(1)):
            items = [g[0] for g in group]

            if dupe:
                _upsert = partial(session.bulk_update_mappings, Posts)
            else:
                _upsert = session.add_all

            try:
                _upsert(items)
                session.commit()
            except IntegrityError:
                # A record was added or deleted after we checked, so retry
                # 
                # modify accordingly by adding additional exceptions, e.g.,
                # except (IntegrityError, ValidationError, ValueError)
                db.session.rollback()
                upsert(items)
            except Exception as e:
                # Some other error occurred so reduce chunksize to isolate the 
                # offending row(s)
                db.session.rollback()
                num_items = len(items)

                if num_items > 1:
                    upsert(items, num_items // 2)
                else:
                    print('Error adding record {}'.format(items[0]))

Here's how you use it

>>> data = [
...     {'id': 1, 'text': 'updated post1'}, 
...     {'id': 5, 'text': 'updated post5'}, 
...     {'id': 1000, 'text': 'new post1000'}]
... 
>>> upsert(data)

The advantage this has over bulk_save_objects is that it can handle relationships, error checking, etc on insert (unlike bulk operations).

Community
  • 1
  • 1
reubano
  • 3,832
  • 32
  • 32
  • It also looks wrong to me. What if a concurrent session inserts a row after you collect your list of IDs? Or deletes one? – Craig Ringer Apr 27 '17 at 02:01
  • good point @CraigRinger I do something similar to this but only have 1 session performing the job. What's the best way to handle multiple sessions then? A transaction perhaps? – reubano Apr 27 '17 at 07:37
  • Transactions aren't the magic solution to all concurrency issues. You could use `SERIALIZABLE` transactions and handle serialization failures but it's slow. You need error handling and a retry loop. See my answer and the "related reading" section in it. – Craig Ringer Apr 27 '17 at 10:45
  • @CraigRinger gotcha. I actually implemented a retry loop in my own case due other validation failures. I'll update this answer accordingly. – reubano Apr 27 '17 at 14:16