Trying to compare 2 queries to understand which one would be faster. Basic idea is to have 1 table ( say test1
) that has no duplicates. Then you trying to insert only delta from second table (say test2
) in such a way that if second table has duplicates, then only 1 copy of the record is inserted.
Prep:
create table test1 (id varchar(10), a bigint, b bigint);
create table test2 (id varchar(10), a bigint, b bigint);
insert into test1 values ('aaa', 1, 1), ('aa2', 1, 2), ('aa3', 1, 3);
insert into test1 values ('bbb', 2, 1), ('bb2', 2, 2);
insert into test1 values ('bbb', 2, 1), ('bb2', 2, 2);
insert into test2 values ('aaa', 1, 1), ('aa2', 1, 2), ('aa3', 1, 3);
Query 1:
INSERT INTO test2
SELECT DISTINCT id,
a,
b
FROM test1
WHERE NOT EXISTS (SELECT *
FROM test2
WHERE test2.id = test1.id);
Query 2:
INSERT INTO test2
SELECT id,
a,
b
FROM (SELECT t2.*
FROM (SELECT Row_number() OVER(partition BY id) AS dup_id,
*
FROM test1) t2
WHERE t2.dup_id = 1) t1
WHERE t1.id NOT IN (SELECT test2.id
FROM test2);
Can someone help me understand which one would be faster and more efficient?
Update
Explain of the 1-st query:db=# explain insert into test2 select distinct id, a, b from test1 where not exists (select * from test2 where test2.id=test1.id); QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
XN Subquery Scan "*SELECT*" (cost=3613333.97..4213334.30 rows=7 width=49)
-> XN Unique (cost=3613333.97..4213334.23 rows=7 width=49)
-> XN Hash Left Join DS_BCAST_INNER (cost=3613333.97..4213334.18 rows=7 width=49)
Hash Cond: ("outer".oid = "inner".oid)
Filter: ("inner".oid IS NULL)
-> XN Seq Scan on test1 (cost=0.00..0.07 rows=7 width=53)
-> XN Hash (cost=3613333.96..3613333.96 rows=5 width=4)
-> XN Subquery Scan volt_dt_1 (cost=1760000.36..3613333.96 rows=5 width=4)
-> XN Unique (cost=1760000.36..3613333.91 rows=5 width=4)
-> XN Hash Join DS_DIST_BOTH (cost=1760000.36..3613333.90 rows=5 width=4)
Outer Dist Key: test1.id
Inner Dist Key: volt_dt_2.id
Hash Cond: (("outer".id)::text = ("inner".id)::text)
-> XN Seq Scan on test1 (cost=0.00..0.07 rows=7 width=37)
-> XN Hash (cost=1760000.34..1760000.34 rows=5 width=33)
-> XN Subquery Scan volt_dt_2 (cost=1760000.29..1760000.34 rows=5 width=33)
-> XN HashAggregate (cost=1760000.29..1760000.29 rows=5 width=33)
-> XN Hash Join DS_DIST_BOTH (cost=0.06..1760000.27 rows=5 width=33)
Outer Dist Key: test1.id
Inner Dist Key: test2.id
Hash Cond: (("outer".id)::text = ("inner".id)::text)
-> XN Seq Scan on test1 (cost=0.00..0.07 rows=7 width=33)
-> XN Hash (cost=0.05..0.05 rows=5 width=33)
-> XN Seq Scan on test2 (cost=0.00..0.05 rows=5 width=33)
----- Tables missing statistics: test2, test1 -----
----- Update statistics by running the ANALYZE command on these tables -----
(26 rows)
Explain of the second query
db=# explain insert into test2 select id, a, b from (select t2.* from ( select row_number() over(partition by id order by id) as dup_id, * from test1 ) t2 where t2.dup_id = 1 ) t1 where t1.id not in (select test2.id from test2);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
XN Hash NOT IN Join DS_DIST_INNER (cost=1000000000000.23..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=49)
Inner Dist Key: db.test2.id
Hash Cond: (("outer".id)::text = ("inner".id)::text)
-> XN Subquery Scan t2 (cost=1000000000000.17..1000000000000.36 rows=1 width=49)
Filter: (dup_id = 1)
-> XN Window (cost=1000000000000.17..1000000000000.27 rows=7 width=49)
Partition: id
Order: id
-> XN Sort (cost=1000000000000.17..1000000000000.19 rows=7 width=49)
Sort Key: id
-> XN Network (cost=0.00..0.07 rows=7 width=49)
Distribute
-> XN Seq Scan on test1 (cost=0.00..0.07 rows=7 width=49)
-> XN Hash (cost=0.05..0.05 rows=5 width=33)
-> XN Seq Scan on test2 (cost=0.00..0.05 rows=5 width=33)
----- Tables missing statistics: test2, test1 -----
----- Update statistics by running the ANALYZE command on these tables -----
(17 rows)