Best way to prevent duplicate data on copy csv postgresql

Question

This is more of a conceptual question because I'm planning how best to achieve our goals here.

I have a postgresql/postgis table with 5 columns. I'll be inserting/appending data into the database from a csv file every 10 minutes or so via the copy command. There will likely be some duplicate rows of data, so I'd like to copy the data from the csv file to the postgresql table but prevent any duplicate entries from getting into the table from the csv file. There are three columns, where if they are all equal, that will mean the entry is a duplicate. They are "latitude", "longitude" and "time". Should I make a composite key from all three columns? If I do that, will it just throw an error upon trying to copy the csv file into the database? I'm going to be copying the csv file automatically so I would want it to go ahead and copy the rest of the file that aren't duplicates and not copy the duplicates. Is there a way to do this?

Also, I of course want it to look for duplicates in the most efficient way. I don't need to look through the whole table (which will be quite large) for duplicates...just the past 20 minutes or so via the timestamp on the row. And I've indexed the db with the time column.

Thanks for any help!

score 3 · Accepted Answer · edited May 23 '17 at 12:24

3

I think I would take the following approach.

First, create an index on the three columns that you care about:

create unique index idx_bigtable_col1_col2_col3 on bigtable(col1, col2, col3);

Then, load the data into a staging table using copy. Finally, you can do:

insert into bigtable(col1, . . . )
    select col1, . . .
    from stagingtable st
    where (col1, col2, col3) not in (select col1, col2, col3 from bigtable);

Assuming no other data modifications are going on, this should accomplish what you want. Checking for duplicates using the index should be ok from a performance perspective.

An alternative method is to emulates MySQL's "on duplicate key update" to ignore such records. Bill Karwin suggests implementing a rule in an answer to this question. The documentation for rules is here. Something similar could also be done with triggers.

edited May 23 '17 at 12:24

Community

1
1

answered Jul 26 '15 at 16:16

Gordon Linoff

1,122,135
50
484
624

Implementing it via rules does work, but it is a bad idea to suggest it to beginners, IMHO. And wrt performance, it will buy you nothing, since under the hood it will effectively execute the same SQL steps as the hardcoded version. – wildplasser Jul 26 '15 at 17:13
@wildplasser . . . It is hard to pass up a recommendation by Bill Karwin. However, I do agree with you. – Gordon Linoff Jul 27 '15 at 02:43
I really appreciate this response. I'm looking forward to testing it out in the near-term. I'll respond with how it goes soon. – user1610717 Jul 27 '15 at 17:20
@user1610717 . . . You may want to ask another question. – Gordon Linoff Aug 12 '15 at 11:06
Sorry didn't realize that comment got sent. Works perfectly...thanks!! – user1610717 Aug 13 '15 at 11:50
Good Answer. Seems like the new `UPSERT` feature applies to this code example. So I posted [an additional Answer](https://stackoverflow.com/a/44447572/642706) showing that alternative syntax. Same result, but a bit simpler. – Basil Bourque Jun 09 '17 at 00:45

Basil Bourque · Answer 2 · 2017-06-09T00:47:31.063

Upsert

The Answer by Linoff is correct but can simplified a bit by Postgres 9.5 new ”UPSERT“ feature (a.k.a. MERGE). That new feature is implemented in Postgres as INSERT ON CONFLICT syntax.

Rather than explicitly check for violation of the unique index, we can let the ON CONFLICT clause detect the violation. Then we DO NOTHING, meaning we abandon the effort to INSERT without bothering to attempt an UPDATE. So if we cannot insert, we just move on to next row.

We get the same results as Linoff’s code but lose the WHERE clause.

INSERT INTO bigtable(col1, … )
    SELECT col1, …
    FROM stagingtable st
ON CONFLICT idx_bigtable_col1_col2_col
DO NOTHING
;

Thank you very much for this updated answer! I'll try to get this worked into the database. — user1610717, Jul 14 '17 at 17:28

score 2 · Answer 3 · edited Jun 10 '20 at 03:00

2

Basil's method was great but there was a slight syntax error. I'm not sure why @Bhargav Rao deleted my previous post. Somehow by referring to documentation https://www.postgresql.org/docs/9.5/sql-insert.html I could get it to work.

INSERT INTO bigtable(col1, … )
    SELECT col1, …
    FROM stagingtable st
ON CONFLICT (col1)
DO NOTHING
;

edited Jun 10 '20 at 03:00

Adel Khayata

2,621
10
25
44

answered Jun 09 '20 at 14:10

Tharaka

21
2

Best way to prevent duplicate data on copy csv postgresql

3 Answers3

Upsert

Linked