Insert row if not exists leads to race condition?

Question

I am implementing a simple web based RSS reader using python (not really relevant) and Postgresql (9.2 if relevant). The database schema is as follows (based on the RSS format):

CREATE TABLE feed_channel
(
    id SERIAL PRIMARY KEY,
    name TEXT,
    link TEXT NOT NULL,
    title TEXT
);
CREATE TABLE feed_content
(
    id SERIAL PRIMARY KEY,
    channel INTEGER REFERENCES feed_channel(id) ON DELETE CASCADE ON UPDATE CASCADE,
    guid TEXT UNIQUE NOT NULL,
    title TEXT,
    link TEXT,
    description TEXT,
    pubdate TIMESTAMP
);

When I create a new channel (and also query for updated feed info) I request the feed, insert its data to the feed_channel table, selects the newly inserted ID - or existing to avoid duplicates - and then add the feed data to the feed_content table. A typical scenario would be:

Query the feed url, grab feed headers and all current content
Insert the feed headers into feed_channel if not exists... if already exists, grab the existing ID
For each feed item, insert into the feed_content table with a reference to the stored channel ID

This is a standard "insert if not already exists, but return relevant ID" problem. To solve this I have implemented the following stored procedure:

CREATE OR REPLACE FUNCTION channel_insert(
  p_link feed_channel.link%TYPE,
  p_title feed_channel.title%TYPE
) RETURNS feed_channel.id%TYPE AS $$
  DECLARE
    v_id feed_channel.id%TYPE;
  BEGIN
    SELECT id
    INTO v_id
    FROM feed_channel
    WHERE link=p_link AND title=p_title
    LIMIT 1;

    IF v_id IS NULL THEN
      INSERT INTO feed_channel(name,link,title)
      VALUES (DEFAULT,p_link,p_title)
      RETURNING id INTO v_id;
    END IF;

    RETURN v_id;

  END;
$$ LANGUAGE plpgsql;

This is then called as "select channel_insert(link, title);" from my application to insert if not already exists and then return the ID of the relevant row regardless of whether it was inserted or just found (step 2 in list above).

This works great!

However, I recently started wondering what would happen if this procedure was executed twice at the same time with the same arguments. Lets assume the following:

User 1 attempts to add a new channel and thereby execute channel_insert
A few ms later, User 2 attempts to add the same channel and also execute channel_insert
User 1's check for existing rows completes, but before the insert is complete, User 2's check completes and says there are no existing rows.

Will this be a potential race condition in PostgreSQL? What is the best way to solve this problem to avoid such scenarios? Is it possible to make the entire stored procedure atomically, i.e. that it can only be executed once at the same time?

One option I tried was to make the fields Unique and then attempt to insert first, and if exception, select the existing instead... This worked, however, the SERIAL field would increment for each attempt, leaving a lot of gaps in the sequence. I don't know if that would be a problem in the long run (probably not), but kind of annoying. Perhaps this is the preferred solution?

Thanks for any feedback. This level of PostgreSQL magic is beyond me, so any feedback would be appreciated.

No matter what you do, be careful to normalize your link format so that you don't have case problems (`Www.Example.Com` and `www.example.com`), parameter order issus (`?a=b&c=d` and `?c=d&a=b`), etc. — mu is too short, Dec 21 '12 at 17:44
A plpgsql function looping in case of a duplicate key violation can deal with the race condition on the server-side and at default isolation level, which is *safe* an typically *cheapest*: http://stackoverflow.com/questions/15939902/is-select-or-insert-in-a-function-prone-to-race-conditions/15950324#15950324 — Erwin Brandstetter, Jan 20 '15 at 11:50

score 5 · Answer 1 · answered Dec 26 '12 at 14:23

Will this be a potential race condition in PostgreSQL?

Yes, and in fact it would be in any database engine.

What is the best way to solve this problem to avoid such scenarios?

This is a loaded question and would require intimate knowledge of the usage of the database by multiple users. However, I'm going to give you some options. In short, the only option you have is to LOCK the table during this process, but how you lock that table will depend on how the database is used throughout the day.

Let's start with the basic LOCK:

LOCK TABLE feed_channel

That will lock the table using the ACCESS EXCLUSIVE lock option.

Conflicts with locks of all modes (ACCESS SHARE, ROW SHARE, ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE). This mode guarantees that the holder is the only transaction accessing the table in any way.

Now, that is the most restrictive lock available, and would certainly resolve the race condition, but may not be exactly what you want. That's something you're going to have to decide. So, though it's clear you're going to have to LOCK the table, it's not clear how.

What are you left to decide?

How do you want to LOCK the table? Study the locking options at that link to make your determination.
Where do you want to LOCK the table? Or in other words, do you want to LOCK at the top of the function (which I think you do based on the possible race condition), or do you simply want to LOCK right before the INSERT?

Is it possible to make the entire stored procedure atomically, i.e. that it can only be executed once at the same time?

No, the code can be executed by anybody that's connected to the database.

I hope that this has helped direct you.

wildplasser · Accepted Answer · 2012-12-26T17:18:52.597

There is an unavoidable "race" here, since two sessions cannot "see" eachothers uncommited rows. On a conflict, a session could only rollback (maybe to a savepoint) and retry. That would typically mean: referring to the other's freshly inserted row, instead of creating a private duplicate.

There is a data-modelling problem here: feed_channel appears to have a lot of candidate keys, and the cascading rule from feed_content could orphanise a lot of feed_content's rows (I suppose content-> channel is 1::M relation; more than one contents-row could refer to the same channel)

Finally, the feed_channel table at least needs the natural key {link,title}. That is where the insert/not exists is all about. (and the whole purpose of this function)

I cleaned up the function a bit. The IF construct is not needed, doing an INSERT WHERE NOT EXISTS first works just as well, and maybe even better.

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;

CREATE TABLE feed_channel
    ( id SERIAL PRIMARY KEY
    , name TEXT
    , link TEXT NOT NULL
    , title TEXT NOT NULL -- part of PK :: must be not nullable
    , CONSTRAINT feed_channel_nat UNIQUE (link,title) -- the natural key
);

CREATE TABLE feed_content
    ( id SERIAL PRIMARY KEY
    , channel INTEGER REFERENCES feed_channel(id) ON DELETE CASCADE ON UPDATE CASCADE
    , guid TEXT UNIQUE NOT NULL -- yet another primary key
    , title TEXT --
    , link TEXT  -- title && link appear to be yet another candidate key
    , description TEXT
    , pubdate TIMESTAMP
    );

-- NOTE: omitted original function channel_insert() for brevity
CREATE OR REPLACE FUNCTION channel_insert_wp(
  p_link feed_channel.link%TYPE,
  p_title feed_channel.title%TYPE
) RETURNS feed_channel.id%TYPE AS $body$
   DECLARE
    v_id feed_channel.id%TYPE;
  BEGIN
      INSERT INTO feed_channel(link,title)
      SELECT p_link,p_title
      WHERE NOT EXISTS ( SELECT *
        FROM feed_channel nx
        WHERE nx.link= p_link
        AND nx.title= p_title
        )
        ;
    SELECT id INTO v_id
    FROM feed_channel ex
    WHERE ex.link= p_link
    AND ex.title= p_title
        ;

    RETURN v_id;

  END;
$body$ LANGUAGE plpgsql;

SELECT channel_insert('Bogus_link', 'Bogus_title');
SELECT channel_insert_wp('Bogus_link2', 'Bogus_title2');

SELECT * FROM feed_channel;

Results:

DROP SCHEMA
CREATE SCHEMA
SET
NOTICE:  CREATE TABLE will create implicit sequence "feed_channel_id_seq" for serial column "feed_channel.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "feed_channel_pkey" for table "feed_channel"
NOTICE:  CREATE TABLE / UNIQUE will create implicit index "feed_channel_nat" for table "feed_channel"
CREATE TABLE
NOTICE:  CREATE TABLE will create implicit sequence "feed_content_id_seq" for serial column "feed_content.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "feed_content_pkey" for table "feed_content"
NOTICE:  CREATE TABLE / UNIQUE will create implicit index "feed_content_guid_key" for table "feed_content"
CREATE TABLE
NOTICE:  type reference feed_channel.link%TYPE converted to text
NOTICE:  type reference feed_channel.title%TYPE converted to text
NOTICE:  type reference feed_channel.id%TYPE converted to integer
CREATE FUNCTION
NOTICE:  type reference feed_channel.link%TYPE converted to text
NOTICE:  type reference feed_channel.title%TYPE converted to text
NOTICE:  type reference feed_channel.id%TYPE converted to integer
CREATE FUNCTION
 channel_insert 
----------------
              1
(1 row)

 channel_insert_wp 
-------------------
                 2
(1 row)

 id | name |    link     |    title     
----+------+-------------+--------------
  1 |      | Bogus_link  | Bogus_title
  2 |      | Bogus_link2 | Bogus_title2
(2 rows)

If I drop the stored procedure altogether and instead just use: `INSERT INTO feed_channel (source) SELECT %(source)s WHERE NOT EXISTS ( SELECT 1 FROM feed_channel WHERE source=%(source)s );` will I ever potentially trigger a unique violation assuming that "source" have a unique constraint? — agnsaft, Dec 29 '12 at 20:17

Clodoaldo Neto · Answer 3 · 2012-12-21T14:19:06.387

3

Your foremost problem is that a serial does not make a good primary key for the feed_channel table. The primary key should be (link, title) or just (link) if title can be null. Then any try to insert an existent feed would raise a primary key error.

BTW v_id will be null whenever title is null:

WHERE link=p_link AND title=p_title

edited Dec 21 '12 at 14:19

answered Dec 21 '12 at 13:17

Clodoaldo Neto

98,807
21
191
235

That is an interesting idea, but won't it be a pain to make a reference/foreign key if the primary key consists of multiple fields? Also, I first tried making a unique constraint to achieve the same result, however, since I need to return the correct ID also on insert failure due to the unique constraint I abandoned that approach. Furthermore, I found it to be a hassle to deal with commits when there is suddenly a failure due to contraints. – agnsaft Dec 21 '12 at 14:07
1

@invictus Why would it be a pain to make a reference/foreign key if the primary key consisted of multiple fields? There would be no `ID`. You would not have to return anything since you already knew what natural key (link, title) you tried to insert. – Clodoaldo Neto Dec 21 '12 at 14:17

Insert row if not exists leads to race condition?

3 Answers3