Find duplicate for a combination of rows - Oracle SQL

Question

I have a table:

table1

unique_id       user_id   user_seq      col_name            value_val    position
1               100       1             test1               100          1
1               100       1             test2               123          1
1               100       1             test1               a            2
1               100       1             test2               text         2
1               100       1             test3               1Rw          2
1               100       1             test4               1Tes         2

2               101       1             test1               1            1
2               101       1             test2               1            1
2               101       1             test3               1            1
2               101       1             test4               1            1
2               101       1             test5               1            1

3               100       1             test1               100          1
3               100       1             test2               123          1
3               100       1             test1               a            2
3               100       1             test2               text         2
3               100       1             test3               1Rw          2
3               100       1             test4               1Tes         2

4               101       1             test1               1            1
4               101       1             test2               1            1
4               101       1             test3               1            1
4               101       1             test4               1            1

I need to find duplicate based on the following:

user_id, user_seq, col_name, value_val and position should be exactly same for different unique_id.

In the above example, unique_id - 1 and 3 are EXACTLY same, so they should be returned as output.

For unique_id = 2 and 4, there is a difference of test5 not available for unique_id = 4, so it won't be captured.

Output would be:

unique_id
1
3

Also, my dataset is huge, around 50 million records, so need an optimized solution. Any help?

EDIT

My table structure:

Name        Null? Type           
----------- ----- --------------         
UNIQUE_ID          NUMBER         
USER_SEQ           VARCHAR2(100)          
COL_NAME           VARCHAR2(263)  
VALUE_VAL          VARCHAR2(4000) 
POSITION           NUMBER             
USER_ID            NUMBER

No indexes available.

I also thought of the same, however that will not work for my case. For ex. user_id = 101 and user_seq = 1 will give me count of 2 for col_name = test1 whereas it should check for a combination of rows. — dang, May 28 '19 at 08:33
*"Also, my dataset is huge, around 50 million records, so need an optimized solution. Any help?"* Also share the `CREATE TABLE` structures so we know the datatypes and indexes.. Also 50 million records isn't "huge" i had to say it.. — Raymond Nijland, May 28 '19 at 09:30
@RaymondNijland - I have shared output of desc table1; in the question. Right now, there are no indexes. — dang, May 28 '19 at 09:54
i need to eat now but i would do something like using `STANDARD_HASH()` in combination with LISTAGG to generate a checksum over a group off records see [example](https://dbfiddle.uk/?rdbms=oracle_18&fiddle=6f15a2785adfad8fa617ab64feda8f27) load it into a temporary table (with with indexes) and you can simply use `GROUP BY, HAVING, COUNT(DISTINCT ) ` syntax on the temporary table.. — Raymond Nijland, May 28 '19 at 10:56
@RaymondNijland - it is giving me error - ORA-01489: result of string concatenation is too long — dang, May 28 '19 at 11:02
*" it is giving me error - ORA-01489: result of string concatenation is too long "* i didn't do Oracle database for a while now so iam rusty but it would make sense on real dataset and it is a `VARCHAR2 ` and if you use `STANDARD_HASH(LISTAGG(TO_CLOB(hash))) ` instead? — Raymond Nijland, May 28 '19 at 11:22
It works for the first time, but if I scroll down, it gives me error - ORA-01489: result of string concatenation is too long — dang, May 28 '19 at 11:24
ok try `ALTER SYSTEM SET max_string_size=extended SCOPE= SPFILE;` `LISTAGG` seams to be limited by `max_string_size` setting... If that does not work count me out as iam then out off options as i didn't do Oracle database for a while now and iam rusty — Raymond Nijland, May 28 '19 at 11:28

Boneist · Accepted Answer · 2019-05-28T10:25:29.743

Here's one way of doing it:

with sample_data as (select 1 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, '100' value_val, 1 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, '123' value_val, 1 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, 'a' value_val, 2 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, 'text' value_val, 2 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test3' col_name, '1Rw' value_val, 2 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test4' col_name, '1Tes' value_val, 2 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test1' col_name, '1' value_val, 1 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test2' col_name, '1' value_val, 1 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test3' col_name, '1' value_val, 1 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test4' col_name, '1' value_val, 1 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test5' col_name, '1' value_val, 1 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, '100' value_val, 1 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, '123' value_val, 1 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, 'a' value_val, 2 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, 'text' value_val, 2 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test3' col_name, '1Rw' value_val, 2 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test4' col_name, '1Tes' value_val, 2 position from dual union all
                     select 4 unique_id, 101 user_id, 1 user_seq, 'test1' col_name, '1' value_val, 1 position from dual union all
                     select 4 unique_id, 101 user_id, 1 user_seq, 'test2' col_name, '1' value_val, 1 position from dual union all
                     select 4 unique_id, 101 user_id, 1 user_seq, 'test3' col_name, '1' value_val, 1 position from dual union all
                     select 4 unique_id, 101 user_id, 1 user_seq, 'test4' col_name, '1' value_val, 1 position from dual union all
                     select 6 unique_id, 101 user_id, 1 user_seq, 'test1' col_name, '1' value_val, 1 position from dual union all
                     select 6 unique_id, 101 user_id, 1 user_seq, 'test2' col_name, '1' value_val, 1 position from dual union all
                     select 6 unique_id, 101 user_id, 1 user_seq, 'test3' col_name, '1' value_val, 1 position from dual union all
                     select 6 unique_id, 101 user_id, 1 user_seq, 'test4' col_name, '1' value_val, 1 position from dual union all
                     select 7 unique_id, 101 user_id, 1 user_seq, 'test1' col_name, '1' value_val, 1 position from dual union all
                     select 7 unique_id, 101 user_id, 1 user_seq, 'test2' col_name, '1' value_val, 1 position from dual union all
                     select 7 unique_id, 101 user_id, 1 user_seq, 'test3' col_name, '1' value_val, 1 position from dual union all
                     select 7 unique_id, 101 user_id, 1 user_seq, 'test4' col_name, '1' value_val, 1 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, '100' value_val, 1 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, '123' value_val, 1 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, 'a' value_val, 2 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, 'text' value_val, 2 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test3' col_name, '1Rw' value_val, 2 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test4' col_name, '1Tes' value_val, 2 position from dual),
            cnts as (select unique_id,
                            user_id,
                            user_seq,
                            col_name,
                            value_val,
                            position,
                            count(*) over (partition by unique_id) cnt
                     from   sample_data),
             res as (select distinct sd1.unique_id id1,
                                     sd2.unique_id id2,
                                     sd1.cnt,
                                     count(*) over (partition by sd1.unique_id, sd2.unique_id) total_id1_rows_cnt
                     from   cnts sd1
                            inner join cnts sd2 on sd1.unique_id < sd2.unique_id
                                                   and sd1.user_id = sd2.user_id
                                                   and sd1.user_seq = sd2.user_seq
                                                   and sd1.col_name = sd2.col_name
                                                   and sd1.value_val = sd2.value_val
                                                   and sd1.position = sd2.position
                                                   and sd1.cnt = sd2.cnt)
select id1||','||listagg(id2, ',') within group (order by id2) grouped_unique_ids
from   res
where  id1 not in (select id2
                   from   res)
and    cnt = total_id1_rows_cnt
group by id1
order by grouped_unique_ids;

And here's the db<>fiddle to prove it works

Should I index any specific columns to get good performance using this SQL? — dang, May 28 '19 at 09:28
I wouldn't think so, since you're querying over the entire table. — Boneist, May 28 '19 at 09:34
*"I wouldn't think so, since you're querying over the entire table."* the join would alteast (ideally) need to have multiple column index to prevent [nested loops join](https://docs.oracle.com/database/121/TGSQL/tgsql_join.htm#GUID-C3D5CEA4-0AF4-4E15-8167-6C5D065A95D3) acces plan? right? — Raymond Nijland, May 28 '19 at 09:43
*"I've worked with Oracle SQL since 1999, and have loved it ever since someone showed me how to join two tables together."* how can you forget the join you love them .. ok jokes aside.. — Raymond Nijland, May 28 '19 at 09:47
@RaymondNijland unlike computers, the memory size of my brain can’t be added too! — Boneist, May 28 '19 at 09:48
Also, the group by doesn't seem to produce correct output, I changed 1 value in unique_id = 5, it gave me result like this - https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=e5163daf970f2b4816c419e9cee510ed — dang, May 28 '19 at 10:00
@dang you're right; the group by doesn't work. Re. the index - you'd need to include all the columns in the join. — Boneist, May 28 '19 at 10:13
@dang the self-join version didn't work either, but I've managed to fix that so it does. — Boneist, May 28 '19 at 10:25
@Boneist - yeah, I was about to comment. Can you please share dbfiddle? — dang, May 28 '19 at 10:26
I saw the updated dbfiddle. Looks promising, but the query is fairly complicated. Is there no simpler way of doing it? — dang, May 28 '19 at 10:29

score 0 · Answer 2 · answered May 28 '19 at 08:59

0

If performance is not an issue, what about a self join?

select a.unique_id as unique_id
from table1 a join table1 b
on a.user_id = b.user_id
and a.user_seq = b.user_seq
and a.col_name = b.col_name
and a.value_val = b.value_val
and a.position = b.position
and a.unique_id <> b.unique_id

answered May 28 '19 at 08:59

CzarLazar

40
7

Ah, my table has more than 50 million records. Is there an optimized way of doing this? – dang May 28 '19 at 09:02
I see. Maybe this would help then.`select unique_id from ( select t.*, count(*) over (partition by [duplicate columns]) ct from table1 t) where ct > 1` – CzarLazar May 28 '19 at 09:19

score 0 · Answer 3 · answered May 28 '19 at 11:38

Assuming that you can concatenate the values into strings, perhaps the simplest method is:

select *
from (select unique_id, count(*) over (partition by vals) as cnt
      from (select unique_id,
                   listagg(user_id || ':' || user_seq || ':' || col_name || ':' || value_val || ':' || position, ',') within group (order by user_id, user_seq, col_name, value_val, position) as vals
            from sample_data sd
            group by unique_id
           ) sd
      ) sd
where cnt > 1;

Here is a db<>fiddle.

Let me emphasize: This is not a generic solution, because of internal string length limitations in Oracle. But it works for you data and may be a convenient solution for your problem.

You are right, this is not generic solution. I am getting error - ORA-01489: result of string concatenation is too long — dang, May 28 '19 at 12:03

Find duplicate for a combination of rows - Oracle SQL

3 Answers3