1

I am importing some data from a csv file into MySQL and trying to ignore duplicate rows.

mysql_query("INSERT IGNORE INTO products (parent_product_url, child_product_url, swatch) VALUES ('".$row[0]."', '".$row[1]."', '".$row[2]."')");

My csv file.

polo.htm,red.htm,red.jpg
polo.htm,green.htm,green.jpg
round-neck.htm,green.htm,green.jpg

Now if I run below csv file it should ignore first three rows as they already exists in the table. It should insert only fourth row.

polo.htm,red.htm,red.jpg
polo.htm,green.htm,green.jpg
round-neck.htm,green.htm,green.jpg
v-neck.htm,red.htm,red.jpg
AZee
  • 492
  • 1
  • 5
  • 21
  • What is the primary key on products and do you have any unique key also? – zedfoxus Oct 14 '15 at 22:43
  • ID is the primary key and it has auto increment, I do not have unique key but I think we can take "child_product_url" as unique. – AZee Oct 14 '15 at 22:45
  • Okay, in that case, MySQL is behaving correctly. `insert ignore` will throw warning and not insert a record if that record violates primary or unique key. Since combination of those 3 fields does not violate any constraints, MySQL happily inserts those data over and over again. To prevent that from happening, you could create unique index on parent_product_url (if it makes sense to do so). – zedfoxus Oct 14 '15 at 22:50
  • I think I got you. So are you suggesting to alter database structure? – AZee Oct 14 '15 at 22:55
  • Oh no. Your database structure is fine. Just like Gordon mentioned, to prevent duplicate values for combination of the 3 fields, you would have to add a unique composite index using the 3 columns. In my answer I have added some examples for reference. – zedfoxus Oct 14 '15 at 23:00
  • I changed child_product_url to Unique and it works but when I insert a new row with different parent_product_url and add existing child_product_url it does not update. – AZee Oct 14 '15 at 23:01
  • Your goal is to ensure that parent_product_url + child_product_url + swatch combination must be unique, correct? – zedfoxus Oct 14 '15 at 23:03
  • Yes that is correct. – AZee Oct 14 '15 at 23:04
  • Great. In that case, create a unique index on those 3 columns. `create unique index uk_products_parenturl_childurl_swatch on products(parent_product_url, child_product_url, swatch);`. Before you make the index, make sure you don't have duplicate data in the table already. – zedfoxus Oct 14 '15 at 23:05
  • Actually I do not have the flexibility to add a fourth column. Can I still achieve it? – AZee Oct 14 '15 at 23:07
  • I don't understand. Could you please re-phrase the question? You want to add a 4th column to the table or to the index or to the CSV? – zedfoxus Oct 14 '15 at 23:08
  • Yes. The create unique index suggested in both answered and previous comment will allow you to do what you intend to do – zedfoxus Oct 14 '15 at 23:43
  • I assume you meant a new column in the database as unique indexes correct? – AZee Oct 14 '15 at 23:47
  • I meant doing this: `create unique index uk_products_parenturl_childurl_swatch on products(parent_product_url, child_product_url, swatch);` Did you execute this statement and try the inserts again? – zedfoxus Oct 15 '15 at 00:22
  • I updated my code to http://pastebin.com/NkL9ijGM and nothing happens. looks like it just executes the second query normally. – AZee Oct 15 '15 at 00:34
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/92332/discussion-between-zedfoxus-and-azee). – zedfoxus Oct 15 '15 at 02:08

3 Answers3

1

I prefer on duplicate key update because insert ignore ignores all errors, not just duplication errors.

Regardless of which you use, your problem is probably the lack of unique constraint/index.

You don't specify what you mean by "duplicate". Assuming you mean all the columns:

create unique index unq_products_3 on products(parent_product_url, child_product_url, swatch);

Note: there is a maximum length to the keys used for indexes, depending on the storage engine. If your columns are too long, you may need to think about other approaches.

Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
0

Records are inserted again when you re-execute insert statements because the inserts are not violating any unique or primary key index. Therefore MySQL doesn't have anything to ignore.

create table products (
  parent_product_url varchar(100), 
  child_product_url varchar(100), 
  swatch varchar(100)
);

-- this will enter both records
insert ignore into products values ('polo.htm', 'red.htm', 'red.jpg');
insert ignore into products values ('polo.htm', 'green.htm', 'green.jpg');

-- this will enter both records **AGAIN**
insert ignore into products values ('polo.htm', 'red.htm', 'red.jpg');
insert ignore into products values ('polo.htm', 'green.htm', 'green.jpg');

Now let's add uniqueness to parent_product_url and try again:

truncate table products;
create unique index uk_products_parent_product_url on products(parent_product_url);
insert ignore into products values ('polo.htm', 'red.htm', 'red.jpg');
insert ignore into products values ('polo.htm', 'green.htm', 'green.jpg');

This will enter only the first record. 2nd record will be ignored and a warning will be thrown. No error will be thrown.

If you desire to have a combination of the 3 columns to be unique, then you would do this (This is what Gordon Linoff has mentioned also...I am just adding more context):

alter table products drop key uk_products_parent_product_url;
create unique index uk_products_parenturl_childurl_swatch on 
  products(parent_product_url, child_product_url, swatch);
insert ignore into products values ('polo.htm', 'red.htm', 'red.jpg');
insert ignore into products values ('polo.htm', 'green.htm', 'green.jpg');

Now you will see only two records inserted even when you re-execute the same 2 insert statements many times.

From https://dev.mysql.com/doc/refman/5.5/en/insert.html

If you use the IGNORE keyword, errors that occur while executing the INSERT statement are ignored. For example, without IGNORE, a row that duplicates an existing UNIQUE index or PRIMARY KEY value in the table causes a duplicate-key error and the statement is aborted. With IGNORE, the row is discarded and no error occurs. Ignored errors may generate warnings instead, although duplicate-key errors do not.

zedfoxus
  • 28,612
  • 4
  • 47
  • 53
0

I got it solved with the help of this Answer -> Insert query check if record exists - If not, Insert it

Below is my updated query

mysql_query("INSERT INTO products (parent_product_url, child_product_url, swatch)
             SELECT * FROM (SELECT '".$row[0]."', '".$row[1]."', '".$row[2]."') AS tmp
             WHERE NOT EXISTS (
             SELECT * FROM products WHERE parent_product_url='".$row[0]."' AND child_product_url='".$row[1]."' AND swatch='".$row[2]."'
             );");
Community
  • 1
  • 1
AZee
  • 492
  • 1
  • 5
  • 21