1

I have some data which I want to add to an existing mysql database. The new data may have entries, which are already saved on DB. Since some of my columns are unique, I get, as expected, an ER_DUP_ENTRY error.

Bulk Insert

Let's say I want to use following statement to save "A", "B" and "C" in a column names of table mytable and "A" is already saved there.

insert into mytable (names) values ("A"), ("B"), ("C");

Is there a way to directly use bulk insert to save "B" and "C" while ignoring "A"? Or do I have to build an insert statement for every new row? This leads to another question:

Normalize Data

Should I assure not to upload duplicate entries before the actual insert statement? In my case I would need to select the data from database, eliminate duplicates and then perform the above seen insert. Or is that a task which is supposed to be done by a database?

Community
  • 1
  • 1
Neskews
  • 554
  • 1
  • 6
  • 18
  • Most often, I believe this is handled by the database server. Take a look at this stack question: https://stackoverflow.com/q/3164505/3148109 – Zach King Aug 09 '18 at 18:21
  • If I get it right, I can't use bulk insert for that task? Because the way it is shown there, it checks for every entry if it already exists. – Neskews Aug 09 '18 at 18:26

2 Answers2

2

If you have UNIQUE constraints that are blocking import, you have a few ways you can work around that:

INSERT IGNORE INTO mytable ...

If any individual rows violate a UNIQUE constraint, they are skipped. Other rows are inserted.

REPLACE INTO mytable ...

If any rows violate a UNIQUE constraint, DELETE the existing row, then INSERT the new row. Keep in mind side-effects of doing this, like if you have foreign keys that cascade on delete referencing the deleted row. Or if the INSERT generates a new auto-increment id.

INSERT INTO mytable ... ON DUPLICATE KEY UPDATE ...

More flexibility. This does not delete the original row, but allows you to set new values for any columns you choose on a case by case basis. See also my answer to "INSERT IGNORE" vs "INSERT ... ON DUPLICATE KEY UPDATE"

If you want to use bulk-loading with mysqlimport or the SQL statement equivalent LOAD DATA INFILE, there are options that match the INSERT IGNORE or REPLACE solutions, but not the INSERT...ON DUPLICATE KEY UPDATE solution.

Read docs for more information:

Bill Karwin
  • 462,430
  • 80
  • 609
  • 762
0

In some situations, I like to do this:

  1. LOAD DATA into a temp table
  2. Clean up the data
  3. Normalize as needed. (2 SQLs per column that needs normalizing -- details)
  4. Augment Summary table(s) (INSERT .. ON DUPLICATE KEY .. SELECT x, y, count(*), sum(z), .. GROUP BY x,y)
  5. Copy clean data from temp table to real table(s) ("Fact" table). (INSERT [IGNORE] .. SELECT [DISTINCT] .. or IODKU with SELECT.)

More on Normalizing:

  • I do it outside any transactions. There are multiple reasons why this is better.
  • At worst (as a result of other failures), I occasionally throw an unused entry in the normalization table. No big deal.
  • No burning of AUTO_INCREMENT ids (except in edge cases).
  • Very fast.

Since REPLACE is a DELETE plus INSERT it is almost guaranteed to be worse than IODKU. However, both burn ids when the rows exist.

If at all possible, do not "loop" through the rows; instead find SQL statements to handle them all at once.

Depending on the details, de-dup in step 2 (if lots of dups) or in step 5 (dups are uncommon).

Rick James
  • 106,233
  • 9
  • 103
  • 171