0

I have a table with the structure given below: Tabel Structure

A User_ID has values for its respective items in the specific time interval. Item value can be text or integer depends upon the item.

I want to check if any Two or more UserId as same values, meaning their items are same with same values and in the same time interval.

As in above table UserId 213456 and UserId 213458 has same records. I tried using cursor and loops, but it's taking too long. My table has more than 50 million UserId. Is there a way to do this in an efficient way?

I also tried using group by with subqueries but all the attempts were failed to create a good query for it.

I have created the following query using How do I find duplicate values in a table in Oracle?

select t1.USERID, count(t1.USERID) 
from USERS_ITEM_VAL t1 
where exists ( select * 
               from  USERS_ITEM_VAL t2 
                where t1.rowid <> t2.rowid and 
                       t2.ITEMID = t1.ITEMID and 
                       t2.TEXT_VALUE = t1.TEXT_VALUE and 
                       --t2.INTEGER_VALUE = t1.INTEGER_VALUE and 
                       t2.INIT_DATE = t1.INIT_DATE and 
                      t2.FINAL_DATE = t1.FINAL_DATE )  
        group by t1.USERID having count(t1.USERID) > 1 order by count(t1.USERID);

But the problem is its working when excluding the INTEGER_VALUE columns but not giving me output when I include INTEGER_VALUE column in the join, though my data in INTEGER_VALUE column is same. Here is the structure of my table:

USERID - NUMBER
ITEMID - NUMBER
TEXT_VALUE - VARCHAR2(500)
INTEGER_VALUE - NUMBER
INIT_DATE - DATE
FINAL_DATE - DATE 
halfer
  • 18,701
  • 13
  • 79
  • 158
John
  • 17
  • 5

2 Answers2

0

One way to approach this uses a self join. The idea is to count the number of items that two users have in common (taking the date columns into account). Then compare this to the number of items that each has:

with t as (
      select t.*, count(*) over (partition by userid) as numitems
      from t
     )
select t1.userid, t2.userid
from t t1 join
     t t2
     on t1.userid < t2.userid and
        t1.itemid = t2.itemid and
        t1.init_date = t2.init_date and
        t1.final_date = t2.final_date and
        t1.numitems = t2.numitems
group by t1.userid, t2.userid, t1.numitems
having count(*) = t1.numitems;
Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
0

The reason your query failed is that either text_value or integer_value will be NULL in every row. For this reason, it's not possible to use an equality predicate in the self-join without using NVL functions to plug the NULL values.

However, below is a query that uses an analytic function to accomplish the goal:

Select * From (
Select t.*, Count(*) Over (Partition By t.itemId, 
                                        t.text_value, 
                                        t.integer_value, 
                                        t.init_date, 
                                        t.final_date) as Cnt)
Where cnt > 1;

The query returns all rows where multiple records have identical values in the five columns of the Partition By clause.

A benefit of this technique over the self-join approach is that the table is scanned only once, whereas it would be scanned twice with a self join. This could result in better performance if the table is large.

Kirby
  • 684
  • 4
  • 7