0

I have a table with a random auto-generated id (primary key). I am trying to avoid the insertion of duplicate rows.

Example of a duplicate row:

id  | field a | field b | field c |
1       4          6         7
2       4          6         7

The key (id) is not duplicate since it is generated with uuid, but all other fields are identical.

I guess I'm looking for somehting like this but in BigQuery language: Avoiding inserting duplicate rows in mySQL

Ants Aus
  • 23
  • 3
  • BigQuery doesn't support primary keys. Are you asking for a column where the generated values are guaranteed to be unique? Or for something else? – Gordon Linoff Nov 07 '20 at 23:55

3 Answers3

0

To help protect your table against duplication, set the insertId property when sending your request.

BigQuery uses the insertId property for de-duplication.

new BigQueryInsertRow(insertId: "row1"){
...
},
new BigQueryInsertRow(insertId: "row2") {
...
}
A___
  • 73
  • 4
0

You can use not exists if you want to avoid inserting duplicate rows into the table:

insert into t (id, a, b, c)
    select i.*
    from (select 2 as id, 4 as a, 6 as b, 7 as c) i
    where not exists (select 1
                      from t
                      where t.a = i.a and t.b = i.b and t.c = i.c
                     );
Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
0

You can use insert into ... where not exists ... query, and it is fine if you do it rarely. But it is kind of anti-pattern if you do it often.

This query needs to scan the table the row is inserted into, so it might get slow and expensive as this table becomes larger. Partitioning and clustering might help, but still if you insert a lot of rows one at a time, this might get costly.

A more common approach is to insert anything, and periodically do deduplication.

Michael Entin
  • 4,936
  • 2
  • 19
  • 21