0

Trying to compare 2 queries to understand which one would be faster. Basic idea is to have 1 table ( say test1) that has no duplicates. Then you trying to insert only delta from second table (say test2) in such a way that if second table has duplicates, then only 1 copy of the record is inserted.

Prep:

create table test1  (id varchar(10), a bigint, b bigint);
create table test2  (id varchar(10), a bigint, b bigint);
insert into test1 values ('aaa', 1, 1), ('aa2', 1, 2), ('aa3', 1, 3);
insert into test1 values ('bbb', 2, 1), ('bb2', 2, 2);
insert into test1 values ('bbb', 2, 1), ('bb2', 2, 2);
insert into test2 values ('aaa', 1, 1), ('aa2', 1, 2), ('aa3', 1, 3);

Query 1:

INSERT INTO test2
SELECT DISTINCT id,
                a,
                b
FROM   test1
WHERE  NOT EXISTS (SELECT *
                   FROM   test2
                   WHERE  test2.id = test1.id);

Query 2:

INSERT INTO test2
    SELECT id,
           a,
           b
    FROM   (SELECT t2.*
            FROM   (SELECT Row_number() OVER(partition BY id) AS dup_id,
                           *
                    FROM   test1) t2
            WHERE  t2.dup_id = 1) t1
    WHERE  t1.id NOT IN (SELECT test2.id
                         FROM   test2);

Can someone help me understand which one would be faster and more efficient?

Update

Explain of the 1-st query:
db=# explain insert into test2 select distinct id, a, b from test1 where not exists (select * from test2 where test2.id=test1.id);                                                            QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 XN Subquery Scan "*SELECT*"  (cost=3613333.97..4213334.30 rows=7 width=49)
   ->  XN Unique  (cost=3613333.97..4213334.23 rows=7 width=49)
         ->  XN Hash Left Join DS_BCAST_INNER  (cost=3613333.97..4213334.18 rows=7 width=49)
               Hash Cond: ("outer".oid = "inner".oid)
               Filter: ("inner".oid IS NULL)
               ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=53)
               ->  XN Hash  (cost=3613333.96..3613333.96 rows=5 width=4)
                     ->  XN Subquery Scan volt_dt_1  (cost=1760000.36..3613333.96 rows=5 width=4)
                           ->  XN Unique  (cost=1760000.36..3613333.91 rows=5 width=4)
                                 ->  XN Hash Join DS_DIST_BOTH  (cost=1760000.36..3613333.90 rows=5 width=4)
                                       Outer Dist Key: test1.id
                                       Inner Dist Key: volt_dt_2.id
                                       Hash Cond: (("outer".id)::text = ("inner".id)::text)
                                       ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=37)
                                       ->  XN Hash  (cost=1760000.34..1760000.34 rows=5 width=33)
                                             ->  XN Subquery Scan volt_dt_2  (cost=1760000.29..1760000.34 rows=5 width=33)
                                                   ->  XN HashAggregate  (cost=1760000.29..1760000.29 rows=5 width=33)
                                                         ->  XN Hash Join DS_DIST_BOTH  (cost=0.06..1760000.27 rows=5 width=33)
                                                               Outer Dist Key: test1.id
                                                               Inner Dist Key: test2.id
                                                               Hash Cond: (("outer".id)::text = ("inner".id)::text)
                                                               ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=33)
                                                               ->  XN Hash  (cost=0.05..0.05 rows=5 width=33)
                                                                     ->  XN Seq Scan on test2  (cost=0.00..0.05 rows=5 width=33)
 ----- Tables missing statistics: test2, test1 -----
 ----- Update statistics by running the ANALYZE command on these tables -----
(26 rows)
Explain of the second query
db=# explain insert into test2 select id, a, b from (select t2.* from ( select row_number() over(partition by id order by id) as dup_id, * from test1 ) t2 where t2.dup_id = 1 ) t1 where t1.id not in (select test2.id from test2); 
                                                                                     QUERY PLAN                                                                                     
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Hash NOT IN Join DS_DIST_INNER  (cost=1000000000000.23..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=49)
   Inner Dist Key: db.test2.id
   Hash Cond: (("outer".id)::text = ("inner".id)::text)
   ->  XN Subquery Scan t2  (cost=1000000000000.17..1000000000000.36 rows=1 width=49)
         Filter: (dup_id = 1)
         ->  XN Window  (cost=1000000000000.17..1000000000000.27 rows=7 width=49)
               Partition: id
               Order: id
               ->  XN Sort  (cost=1000000000000.17..1000000000000.19 rows=7 width=49)
                     Sort Key: id
                     ->  XN Network  (cost=0.00..0.07 rows=7 width=49)
                           Distribute
                           ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=49)
   ->  XN Hash  (cost=0.05..0.05 rows=5 width=33)
         ->  XN Seq Scan on test2  (cost=0.00..0.05 rows=5 width=33)
 ----- Tables missing statistics: test2, test1 -----
 ----- Update statistics by running the ANALYZE command on these tables -----
(17 rows)
a_horse_with_no_name
  • 440,273
  • 77
  • 685
  • 758
Vor
  • 27,623
  • 37
  • 119
  • 184
  • 2
    try explain or explain analyze – Soni Harriz Jul 08 '16 at 20:46
  • 1
    **First**. Why dont you try it and check explain plan? Performance questions should include `EXPLAIN ANALYZE` and some information about table size, index, current time performance, desire time, etc. `Slow` is a relative term and we need a real value to compare. **Second** Postgre 9.5+ has already one solution implemented check [**HERE**](http://stackoverflow.com/questions/1009584/how-to-emulate-insert-ignore-and-on-duplicate-key-update-sql-merge-with-po/34639631#34639631) – Juan Carlos Oropeza Jul 08 '16 at 20:47
  • @JuanCarlosOropeza thank a lot for the link, didn't know about that. But unfortunately Im running it on postgres which is a fork of 8.x postgres .... – Vor Jul 08 '16 at 20:53
  • @Vor Yes i have the same kind of problem right now and also couldnt use it because Im 9.3... but if you check related links to that questions you will see lof of solutions. Warning Is a long path ;) – Juan Carlos Oropeza Jul 08 '16 at 20:56

2 Answers2

2

I think first should be faster, although it wants an index on test2(id).

Normally, the answer to such questions is "try it on your data and your system . . . and let us know". However, the row_number() requires a full scan of table1. You might as well do the index lookup in table2 at the same time -- which is the first version.

Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
1

As everyone knows INSERT is always expensive comparing to SELECT, UPDATE and DELETE since the INSERT doesn't have WHERE clauses and also more expensive when we have more indexes in a table.

Still both the query is doing an INSERT. So, my above point is not valid.

  1. In your first query, DISTINCT definitely needs a Table scan, and in the WHERE clause instead of NOT EXISTS, you can check the ID with the NOT IN operator

INSERT INTO test2
SELECT DISTINCT id,
a,
b
FROM test1
WHERE id NOT IN (SELECT id FROM test2);

As Everyone knows, SELECT * FROM Table is expensive, better to use required attributes.

  1. In your second query, As Gordon said, ROW_NUMBER needs a table scan too and you are using lots of derived tables in that query, so i guess the first query is optimized than the second.
Muthaiah PL
  • 806
  • 1
  • 11
  • 21