14

I've got this table (generated by Django):

CREATE TABLE feeds_person (
  id serial PRIMARY KEY,
  created timestamp with time zone NOT NULL,
  modified timestamp with time zone NOT NULL,
  name character varying(4000) NOT NULL,
  url character varying(1000) NOT NULL,
  email character varying(254) NOT NULL,
  CONSTRAINT feeds_person_name_ad8c7469_uniq UNIQUE (name, url, email)
);

I'm trying to bulk insert a lot of data using INSERT with an ON CONFLICT clause.

The wrinkle is that I need to get the id back for all of the rows, whether they're already existing or not.

In other cases, I would do something like:

INSERT INTO feeds_person (created, modified, name, url, email)
VALUES blah blah blah
ON CONFLICT (name, url, email) DO UPDATE SET url = feeds_person.url
RETURNING id

Doing the UPDATE causes the statement to return the id of that row. Except, it doesn't work with this table. I think it doesn't work because I've got multiple fields unique together whereas in other instances I've used this method I've had just one unique field.

I get this error when trying to run the SQL through Django's cursor:

django.db.utils.ProgrammingError: ON CONFLICT DO UPDATE command cannot affect row a second time
HINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values.

How do I do the bulk insert with this table and get back the inserted and existing ids?

Erwin Brandstetter
  • 479,275
  • 111
  • 893
  • 1,042
Dustin Wyatt
  • 3,551
  • 5
  • 26
  • 52
  • `I need to get the id back for all of the rows`. Obviously, you have duplicates on `(name, url, email)` in your `VALUES` list. Do you need to preserve those dupes or can those be folded? If yes, which peer to pick from sets of dupes? And do you have to deal with concurrent write access? – Erwin Brandstetter Mar 12 '16 at 03:36
  • @ErwinBrandstetter I'm dumping tens of thousands of records a minute from external data sources that I request data from periodically. Each time I request data from external source it returns some set of data that I already have along with new data. I don't need to update the data that I already have. I don't know why it didn't click with me that that error was indicating the duplicates in my VALUES. Duplicates in the VALUES can be folded into just one record. – Dustin Wyatt Mar 12 '16 at 18:25

1 Answers1

28

The error you get:

ON CONFLICT DO UPDATE command cannot affect row a second time

indicates you are trying to upsert the same row more than once in a single command. In other words: you have dupes on (name, url, email) in your VALUES list. Fold duplicates (if that's an option) and it should work. But you will have to decide which row to pick from each set of dupes.

INSERT INTO feeds_person (created, modified, name, url, email)
SELECT DISTINCT ON (name, url, email) *
FROM  (
   VALUES
   ('blah', 'blah', 'blah', 'blah', 'blah')
   -- ... more
   ) v(created, modified, name, url, email)  -- match column list
ON     CONFLICT (name, url, email) DO UPDATE
SET    url = feeds_person.url
RETURNING id;

Since we use a free-standing VALUES expression now, you have to add explicit type casts for non-default types. Like:

VALUES
    (timestamptz '2016-03-12 02:47:56+01'
   , timestamptz '2016-03-12 02:47:56+01'
   , 'n3', 'u3', 'e3')
   ...

Your timestamptz columns need an explicit type cast, while the string types can operate with default text. (You could still cast to varchar(n) right away.)

There are ways to determine which row to pick from each set of dupes:

You are right, there is (currently) no way to get excluded rows in the RETURNING clause. I quote the Postgres Wiki:

Note that RETURNING does not make visible the "EXCLUDED.*" alias from the UPDATE (just the generic "TARGET.*" alias is visible there). Doing so is thought to create annoying ambiguity for the simple, common cases [30] for little to no benefit. At some point in the future, we may pursue a way of exposing if RETURNING-projected tuples were inserted and updated, but this probably doesn't need to make it into the first committed iteration of the feature [31].

However, you shouldn't be updating rows that are not supposed to be updated. Empty updates are almost as expensive as regular updates - and might have unintended side effects. You don't strictly need UPSERT to begin with, your case looks more like "SELECT or INSERT". Related:

One cleaner way to insert a set of rows would be with data-modifying CTEs:

WITH val AS (
   SELECT DISTINCT ON (name, url, email) *
   FROM  (
      VALUES 
      (timestamptz '2016-1-1 0:0+1', timestamptz '2016-1-1 0:0+1', 'n', 'u', 'e')
    , ('2016-03-12 02:47:56+01', '2016-03-12 02:47:56+01', 'n1', 'u3', 'e3')
      -- more (type cast only needed in 1st row)
      ) v(created, modified, name, url, email)
   )
, ins AS (
   INSERT INTO feeds_person (created, modified, name, url, email)
   SELECT created, modified, name, url, email FROM val
   ON     CONFLICT (name, url, email) DO NOTHING
   RETURNING id, name, url, email
   )
SELECT 'inserted' AS how, id FROM ins  -- inserted
UNION  ALL
SELECT 'selected' AS how, f.id         -- not inserted
FROM   val v
JOIN   feeds_person f USING (name, url, email);

The added complexity should pay for big tables where INSERT is the rule and SELECT the exception.

Originally, I had added a NOT EXISTS predicate on the last SELECT to prevent duplicates in the result. But that was redundant. All CTEs of a single query see the same snapshots of tables. The set returned with ON CONFLICT (name, url, email) DO NOTHING is mutually exclusive to the set returned after the INNER JOIN on the same columns.

Unfortunately this also opens a tiny window for a race condition. If ...

  • a concurrent transaction inserts conflicting rows
  • has not committed yet
  • but commits eventually

... some rows may be lost.

You might just INSERT .. ON CONFLICT DO NOTHING, followed by a separate SELECT query for all rows - within the same transaction to overcome this. Which in turn opens another tiny window for a race condition if concurrent transactions can commit writes to the table between INSERT and SELECT (in default READ COMMITTED isolation level). Can be avoided with REPEATABLE READ transaction isolation (or stricter). Or with a (possibly expensive or even unacceptable) write lock on the whole table. You can get any behavior you need, but there may be a price to pay.

Related:

Erwin Brandstetter
  • 479,275
  • 111
  • 893
  • 1,042
  • I responded to your comment on the question, but to elaborate I don't really want to update anything, the only reason I was doing an UPDATE was because that's the only way I could figure out how to get the id back of the already existing rows. In other words, when I insert a large chunk of new data, I need the ids of the rows I inserted as well as the ids of the already existing rows...and I need them in the order I provide them in VALUES. AFAICT, that's the behavior PostgrSQL 9.5 exhibits...the docs aren't entirely clear to me if that's the guaranteed behavior though. – Dustin Wyatt Mar 12 '16 at 18:29
  • 1
    @DustinWyatt: Consider the update and note the added bits to defend against a race condition. – Erwin Brandstetter Mar 18 '16 at 16:54