How to avoid inserting duplicate rows when primary key (ID) is randomly generated in BigQuery

Question

I have a table with a random auto-generated id (primary key). I am trying to avoid the insertion of duplicate rows.

Example of a duplicate row:

id  | field a | field b | field c |
1       4          6         7
2       4          6         7

The key (id) is not duplicate since it is generated with uuid, but all other fields are identical.

I guess I'm looking for somehting like this but in BigQuery language: Avoiding inserting duplicate rows in mySQL

BigQuery doesn't support primary keys. Are you asking for a column where the generated values are guaranteed to be unique? Or for something else? — Gordon Linoff, Nov 07 '20 at 23:55

score 0 · Answer 1 · answered Nov 07 '20 at 21:33

To help protect your table against duplication, set the insertId property when sending your request.

BigQuery uses the insertId property for de-duplication.

new BigQueryInsertRow(insertId: "row1"){
...
},
new BigQueryInsertRow(insertId: "row2") {
...
}

score 0 · Accepted Answer · answered Nov 07 '20 at 23:58

You can use not exists if you want to avoid inserting duplicate rows into the table:

insert into t (id, a, b, c)
    select i.*
    from (select 2 as id, 4 as a, 6 as b, 7 as c) i
    where not exists (select 1
                      from t
                      where t.a = i.a and t.b = i.b and t.c = i.c
                     );

score 0 · Answer 3 · answered Nov 09 '20 at 01:46

You can use insert into ... where not exists ... query, and it is fine if you do it rarely. But it is kind of anti-pattern if you do it often.

This query needs to scan the table the row is inserted into, so it might get slow and expensive as this table becomes larger. Partitioning and clustering might help, but still if you insert a lot of rows one at a time, this might get costly.

A more common approach is to insert anything, and periodically do deduplication.

How to avoid inserting duplicate rows when primary key (ID) is randomly generated in BigQuery

3 Answers3