1

I have a table like

create table dados(
    id INT AUTO_INCREMENT,
    id2 VARCHAR(18) NOT NULL,
    id3 VARCHAR(18) NOT NULL,
    ...
    PRIMARY KEY (id));

I need to allow rows that have the same combination of (id2, id3), but I want to do a load data local infile into this table skipping matches of those combinations.

I've tried doing load file into a temporary table

create temporary table temp_tbl like dados;

and from there I tried these (syntax not necessarily correct here):

insert into dados select * from temp_tbl where not exists 
( select 1 from dados where dados.id2=temp_tbl.id2 and dados.id3=temp_tbl.id3);

and

insert into dados select * from temp_table left join dados
on dados.id2=temp_table.id2 and dados.id3=temp_table.id3
where dados.id2 is null and dados.id3 is null;

But both querys take forever as the table has a few million entries.

As an alternative to these, I thought about creating another column that would have either the value of id3 or, in case of repetition of the combination (id2,id3), have it be id3_'some other incremental number here', so I would have

create table dados(
  id INT AUTO_INCREMENT,
  id2 VARCHAR(18) NOT NULL,
  id3 VARCHAR(18) NOT NULL,
  ...
  other_identification varchar(25) NOT NULL,
  PRIMARY KEY (id,other_identification));

But I would like to avoid this solution as it creates a probably unnecessary column into an already really big table. Another thing I thought about doing was having the primary key be (id,id2,id3) and on the insert have maybe an IGNORE or an ON DUPLICATE KEY, but using only id2 and id3 as keys, however I could not find anything similar to this.

Any help would be apreciated.

ghokane
  • 23
  • 1
  • 6
  • Define a unique key for the temp table and use [this answer](https://stackoverflow.com/a/13425727/5563083). – Paul Spiegel Aug 15 '19 at 16:08
  • @PaulSpiegel but then it would only prevent the data from the load file to be duplicate, if the same combination from the load file is already in the table dados, it would still be inserted, right? Or am I missing something? – ghokane Aug 15 '19 at 16:15
  • Right.. I missed that point. Do you have at least an index on `(id2, id3)`? You need an index to support a duplicate check. And how many rows are in the CSV file? – Paul Spiegel Aug 15 '19 at 16:18
  • I don't, maybe with the index the methods I tried would be faster? I will give it a check, thanks. – ghokane Aug 15 '19 at 16:20
  • Regarding the last point. That wouldn't help. A primary key on (id,id2,id3) will not stop you from inserting duplicates in (id2, id3) combination, since `id` is already unique "on it's own". – Paul Spiegel Aug 15 '19 at 18:49

1 Answers1

1

Which ever method you try, it will be slow without an index on (id2, id3).

I would do the following:

Define the index in the original table:

alter table dados add index (id2, id3);

Create the temporary table with a unique index on (id2, id3). It can actually be the primary key:

create temporary table temp_tbl(
  id2 VARCHAR(18) NOT NULL,
  id3 VARCHAR(18) NOT NULL,
  ...
  PRIMARY KEY (id2, id3)

Use LOAD DATA INFILE with IGNORE, to ignore duplicates in the CSV files:

LOAD DATA INFILE '/path/to/file.csv'
IGNORE INTO TABLE temp_tbl
...

Remove existing (id2, id3) combinations with a DELETE .. JOIN query:

delete t
from temp_tbl t
join dados d on d.id2 = t.id2 and d.id3 = t.id3;

Insert data from the temporary table into the original one:

insert into dados (id2, id3, ...)
  select id2, id3, ...
  from temp_tbl;

See demo on db-fiddle.com.

Paul Spiegel
  • 27,993
  • 5
  • 38
  • 48
  • I actually forgot to mention that the entries on the load file did not have duplicates, and had to do an update if found the match. But creating the indexes I was able to first do the update from the temporary table into dados using UPDATE .. JOIN and then use the INSERT .. JOIN I mentioned on the question. Marking as solved because using the index was the actual solution, thanks! – ghokane Aug 15 '19 at 18:33
  • You also forgot to mention anything about an UPDATE :-) But nice to see that people are actually able to extract what they need from an answer. – Paul Spiegel Aug 15 '19 at 18:46