I am implementing a simple web based RSS reader using python (not really relevant) and Postgresql (9.2 if relevant). The database schema is as follows (based on the RSS format):
CREATE TABLE feed_channel
(
id SERIAL PRIMARY KEY,
name TEXT,
link TEXT NOT NULL,
title TEXT
);
CREATE TABLE feed_content
(
id SERIAL PRIMARY KEY,
channel INTEGER REFERENCES feed_channel(id) ON DELETE CASCADE ON UPDATE CASCADE,
guid TEXT UNIQUE NOT NULL,
title TEXT,
link TEXT,
description TEXT,
pubdate TIMESTAMP
);
When I create a new channel (and also query for updated feed info) I request the feed, insert its data to the feed_channel table, selects the newly inserted ID - or existing to avoid duplicates - and then add the feed data to the feed_content table. A typical scenario would be:
- Query the feed url, grab feed headers and all current content
- Insert the feed headers into feed_channel if not exists... if already exists, grab the existing ID
- For each feed item, insert into the feed_content table with a reference to the stored channel ID
This is a standard "insert if not already exists, but return relevant ID" problem. To solve this I have implemented the following stored procedure:
CREATE OR REPLACE FUNCTION channel_insert(
p_link feed_channel.link%TYPE,
p_title feed_channel.title%TYPE
) RETURNS feed_channel.id%TYPE AS $$
DECLARE
v_id feed_channel.id%TYPE;
BEGIN
SELECT id
INTO v_id
FROM feed_channel
WHERE link=p_link AND title=p_title
LIMIT 1;
IF v_id IS NULL THEN
INSERT INTO feed_channel(name,link,title)
VALUES (DEFAULT,p_link,p_title)
RETURNING id INTO v_id;
END IF;
RETURN v_id;
END;
$$ LANGUAGE plpgsql;
This is then called as "select channel_insert(link, title);" from my application to insert if not already exists and then return the ID of the relevant row regardless of whether it was inserted or just found (step 2 in list above).
This works great!
However, I recently started wondering what would happen if this procedure was executed twice at the same time with the same arguments. Lets assume the following:
- User 1 attempts to add a new channel and thereby execute channel_insert
- A few ms later, User 2 attempts to add the same channel and also execute channel_insert
- User 1's check for existing rows completes, but before the insert is complete, User 2's check completes and says there are no existing rows.
Will this be a potential race condition in PostgreSQL? What is the best way to solve this problem to avoid such scenarios? Is it possible to make the entire stored procedure atomically, i.e. that it can only be executed once at the same time?
One option I tried was to make the fields Unique and then attempt to insert first, and if exception, select the existing instead... This worked, however, the SERIAL field would increment for each attempt, leaving a lot of gaps in the sequence. I don't know if that would be a problem in the long run (probably not), but kind of annoying. Perhaps this is the preferred solution?
Thanks for any feedback. This level of PostgreSQL magic is beyond me, so any feedback would be appreciated.