0

I have a table:

table1

unique_id       user_id   user_seq      col_name            value_val    position
1               100       1             test1               100          1
1               100       1             test2               123          1
1               100       1             test1               a            2
1               100       1             test2               text         2
1               100       1             test3               1Rw          2
1               100       1             test4               1Tes         2

2               101       1             test1               1            1
2               101       1             test2               1            1
2               101       1             test3               1            1
2               101       1             test4               1            1
2               101       1             test5               1            1

3               100       1             test1               100          1
3               100       1             test2               123          1
3               100       1             test1               a            2
3               100       1             test2               text         2
3               100       1             test3               1Rw          2
3               100       1             test4               1Tes         2

4               101       1             test1               1            1
4               101       1             test2               1            1
4               101       1             test3               1            1
4               101       1             test4               1            1

I need to find duplicate based on the following:

user_id, user_seq, col_name, value_val and position should be exactly same for different unique_id.

In the above example, unique_id - 1 and 3 are EXACTLY same, so they should be returned as output.

For unique_id = 2 and 4, there is a difference of test5 not available for unique_id = 4, so it won't be captured.

Output would be:

unique_id
1
3

Also, my dataset is huge, around 50 million records, so need an optimized solution. Any help?

EDIT

My table structure:

Name        Null? Type           
----------- ----- --------------         
UNIQUE_ID          NUMBER         
USER_SEQ           VARCHAR2(100)          
COL_NAME           VARCHAR2(263)  
VALUE_VAL          VARCHAR2(4000) 
POSITION           NUMBER             
USER_ID            NUMBER  

No indexes available.

dang
  • 2,030
  • 3
  • 33
  • 61
  • GROUP BY, HAVING, COUNT(DISTINCT ) – jarlh May 28 '19 at 08:25
  • I also thought of the same, however that will not work for my case. For ex. user_id = 101 and user_seq = 1 will give me count of 2 for col_name = test1 whereas it should check for a combination of rows. – dang May 28 '19 at 08:33
  • *"Also, my dataset is huge, around 50 million records, so need an optimized solution. Any help?"* Also share the `CREATE TABLE` structures so we know the datatypes and indexes.. Also 50 million records isn't "huge" i had to say it.. – Raymond Nijland May 28 '19 at 09:30
  • @RaymondNijland - I have shared output of desc table1; in the question. Right now, there are no indexes. – dang May 28 '19 at 09:54
  • Also which Oracle database version are you using? – Raymond Nijland May 28 '19 at 10:20
  • @RaymondNijland - I am using Oracle Enterprise 18c. – dang May 28 '19 at 10:24
  • i need to eat now but i would do something like using `STANDARD_HASH()` in combination with LISTAGG to generate a checksum over a group off records see [example](https://dbfiddle.uk/?rdbms=oracle_18&fiddle=6f15a2785adfad8fa617ab64feda8f27) load it into a temporary table (with with indexes) and you can simply use `GROUP BY, HAVING, COUNT(DISTINCT ) ` syntax on the temporary table.. – Raymond Nijland May 28 '19 at 10:56
  • @RaymondNijland - it is giving me error - ORA-01489: result of string concatenation is too long – dang May 28 '19 at 11:02
  • *" it is giving me error - ORA-01489: result of string concatenation is too long "* i didn't do Oracle database for a while now so iam rusty but it would make sense on real dataset and it is a `VARCHAR2 ` and if you use `STANDARD_HASH(LISTAGG(TO_CLOB(hash))) ` instead? – Raymond Nijland May 28 '19 at 11:22
  • It works for the first time, but if I scroll down, it gives me error - ORA-01489: result of string concatenation is too long – dang May 28 '19 at 11:24
  • ok try `ALTER SYSTEM SET max_string_size=extended SCOPE= SPFILE;` `LISTAGG` seams to be limited by `max_string_size` setting... If that does not work count me out as iam then out off options as i didn't do Oracle database for a while now and iam rusty – Raymond Nijland May 28 '19 at 11:28

3 Answers3

1

Here's one way of doing it:

with sample_data as (select 1 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, '100' value_val, 1 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, '123' value_val, 1 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, 'a' value_val, 2 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, 'text' value_val, 2 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test3' col_name, '1Rw' value_val, 2 position from dual union all
                     select 1 unique_id, 100 user_id, 1 user_seq, 'test4' col_name, '1Tes' value_val, 2 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test1' col_name, '1' value_val, 1 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test2' col_name, '1' value_val, 1 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test3' col_name, '1' value_val, 1 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test4' col_name, '1' value_val, 1 position from dual union all
                     select 2 unique_id, 101 user_id, 1 user_seq, 'test5' col_name, '1' value_val, 1 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, '100' value_val, 1 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, '123' value_val, 1 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, 'a' value_val, 2 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, 'text' value_val, 2 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test3' col_name, '1Rw' value_val, 2 position from dual union all
                     select 3 unique_id, 100 user_id, 1 user_seq, 'test4' col_name, '1Tes' value_val, 2 position from dual union all
                     select 4 unique_id, 101 user_id, 1 user_seq, 'test1' col_name, '1' value_val, 1 position from dual union all
                     select 4 unique_id, 101 user_id, 1 user_seq, 'test2' col_name, '1' value_val, 1 position from dual union all
                     select 4 unique_id, 101 user_id, 1 user_seq, 'test3' col_name, '1' value_val, 1 position from dual union all
                     select 4 unique_id, 101 user_id, 1 user_seq, 'test4' col_name, '1' value_val, 1 position from dual union all
                     select 6 unique_id, 101 user_id, 1 user_seq, 'test1' col_name, '1' value_val, 1 position from dual union all
                     select 6 unique_id, 101 user_id, 1 user_seq, 'test2' col_name, '1' value_val, 1 position from dual union all
                     select 6 unique_id, 101 user_id, 1 user_seq, 'test3' col_name, '1' value_val, 1 position from dual union all
                     select 6 unique_id, 101 user_id, 1 user_seq, 'test4' col_name, '1' value_val, 1 position from dual union all
                     select 7 unique_id, 101 user_id, 1 user_seq, 'test1' col_name, '1' value_val, 1 position from dual union all
                     select 7 unique_id, 101 user_id, 1 user_seq, 'test2' col_name, '1' value_val, 1 position from dual union all
                     select 7 unique_id, 101 user_id, 1 user_seq, 'test3' col_name, '1' value_val, 1 position from dual union all
                     select 7 unique_id, 101 user_id, 1 user_seq, 'test4' col_name, '1' value_val, 1 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, '100' value_val, 1 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, '123' value_val, 1 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test1' col_name, 'a' value_val, 2 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test2' col_name, 'text' value_val, 2 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test3' col_name, '1Rw' value_val, 2 position from dual union all
                     select 5 unique_id, 100 user_id, 1 user_seq, 'test4' col_name, '1Tes' value_val, 2 position from dual),
            cnts as (select unique_id,
                            user_id,
                            user_seq,
                            col_name,
                            value_val,
                            position,
                            count(*) over (partition by unique_id) cnt
                     from   sample_data),
             res as (select distinct sd1.unique_id id1,
                                     sd2.unique_id id2,
                                     sd1.cnt,
                                     count(*) over (partition by sd1.unique_id, sd2.unique_id) total_id1_rows_cnt
                     from   cnts sd1
                            inner join cnts sd2 on sd1.unique_id < sd2.unique_id
                                                   and sd1.user_id = sd2.user_id
                                                   and sd1.user_seq = sd2.user_seq
                                                   and sd1.col_name = sd2.col_name
                                                   and sd1.value_val = sd2.value_val
                                                   and sd1.position = sd2.position
                                                   and sd1.cnt = sd2.cnt)
select id1||','||listagg(id2, ',') within group (order by id2) grouped_unique_ids
from   res
where  id1 not in (select id2
                   from   res)
and    cnt = total_id1_rows_cnt
group by id1
order by grouped_unique_ids;

And here's the db<>fiddle to prove it works

Boneist
  • 21,504
  • 1
  • 23
  • 39
  • Should I index any specific columns to get good performance using this SQL? – dang May 28 '19 at 09:28
  • I wouldn't think so, since you're querying over the entire table. – Boneist May 28 '19 at 09:34
  • *"I wouldn't think so, since you're querying over the entire table."* the join would alteast (ideally) need to have multiple column index to prevent [nested loops join](https://docs.oracle.com/database/121/TGSQL/tgsql_join.htm#GUID-C3D5CEA4-0AF4-4E15-8167-6C5D065A95D3) acces plan? right? – Raymond Nijland May 28 '19 at 09:43
  • *"I've worked with Oracle SQL since 1999, and have loved it ever since someone showed me how to join two tables together."* how can you forget the join you love them .. ok jokes aside.. – Raymond Nijland May 28 '19 at 09:47
  • 1
    @RaymondNijland unlike computers, the memory size of my brain can’t be added too! – Boneist May 28 '19 at 09:48
  • Also, the group by doesn't seem to produce correct output, I changed 1 value in unique_id = 5, it gave me result like this - https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=e5163daf970f2b4816c419e9cee510ed – dang May 28 '19 at 10:00
  • @dang you're right; the group by doesn't work. Re. the index - you'd need to include all the columns in the join. – Boneist May 28 '19 at 10:13
  • @dang the self-join version didn't work either, but I've managed to fix that so it does. – Boneist May 28 '19 at 10:25
  • @Boneist - yeah, I was about to comment. Can you please share dbfiddle? – dang May 28 '19 at 10:26
  • I saw the updated dbfiddle. Looks promising, but the query is fairly complicated. Is there no simpler way of doing it? – dang May 28 '19 at 10:29
0

If performance is not an issue, what about a self join?

select a.unique_id as unique_id
from table1 a join table1 b
on a.user_id = b.user_id
and a.user_seq = b.user_seq
and a.col_name = b.col_name
and a.value_val = b.value_val
and a.position = b.position
and a.unique_id <> b.unique_id
CzarLazar
  • 40
  • 7
  • Ah, my table has more than 50 million records. Is there an optimized way of doing this? – dang May 28 '19 at 09:02
  • I see. Maybe this would help then.`select unique_id from ( select t.*, count(*) over (partition by [duplicate columns]) ct from table1 t) where ct > 1` – CzarLazar May 28 '19 at 09:19
0

Assuming that you can concatenate the values into strings, perhaps the simplest method is:

select *
from (select unique_id, count(*) over (partition by vals) as cnt
      from (select unique_id,
                   listagg(user_id || ':' || user_seq || ':' || col_name || ':' || value_val || ':' || position, ',') within group (order by user_id, user_seq, col_name, value_val, position) as vals
            from sample_data sd
            group by unique_id
           ) sd
      ) sd
where cnt > 1;

Here is a db<>fiddle.

Let me emphasize: This is not a generic solution, because of internal string length limitations in Oracle. But it works for you data and may be a convenient solution for your problem.

Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624