Excluding results that appear in another column of a CONNECT BY query

Question

Have a heavy query (takes 15 minutes to run), but it's returning more results than I need. It's a CONNECT BY query, and I'm getting nodes that are descendants in the root node results. I.E.:

Ted
  Bob
    John
Bob
  John
John

Normally, the way to resolve this is using a START WITH condition, typically requiring the parent of a node to be null. But due to the nature of the query, I don't have the START WITH values I need to compare to until I have the full resultset. I'm basically trying to double-query my results to say QUERY STUFF START WITH RECORDS THAT AREN'T IN THAT STUFF.

Here's the query (built with the help of Nicholas Krasnov, here: Oracle Self-Join on multiple possible column matches - CONNECT BY?):

select cudroot.root_user, cudroot.node_level, cudroot.user_id, cudroot.new_user_id,
       cudbase.*  -- Not really, just simplyfing
from   css.user_desc cudbase
  join (select connect_by_root(user_id) root_user,   
               user_id                  user_id,        
               new_user_id              new_user_id,
               level                    node_level
        from   (select cudordered.user_id,      
                       coalesce(cudordered.new_user_id, cudordered.nextUser) new_user_id
                from   (select cud.user_id, 
                               cud.new_user_id, 
                               decode(cud.global_hr_id, null, null, lead(cud.user_id ignore nulls) over (partition by cud.global_hr_id order by cud.user_id)) nextUser
                        from   css.user_desc cud
                          left join gsu.stg_userdata gstgu
                            on (gstgu.user_id = cud.user_id 
                                or (gstgu.sap_asoc_global_id = cud.global_hr_id))
                        where  upper(cud.user_type_code) in ('EMPLOYEE','CONTRACTOR','DIV_EMPLOYEE','DIV_CONTRACTOR','DIV_MYTEAPPROVED')) cudordered)
        connect by nocycle user_id = prior new_user_id) cudroot
    on cudbase.user_id = cudroot.user_id
order by 
       cudroot.root_user, cudroot.node_level, cudroot.user_id;

This gives me results about related users (based off of user_id renames or associated SAP IDs) that look like this:

ROOT_ID     LEVEL   USER_ID         NEW_USER_ID
------------------------------------------------
A5093522    1       A5093522        FG096489
A5093522    2       FG096489        A5093665
A5093522    3       A5093665        
FG096489    1       FG096489        A5093665
FG096489    2       A5093665
A5093665    1       A5093665

What I need is a way to filter the first join (select connect_by_root(user_id)... to exclude FG096489 and A5093665 from the root list.

The best START WITH I can think of would look like this (not tested yet):

start with user_id not in (select new_user_id 
                           from   (select coalesce(cudordered.new_user_id, cudordered.nextUser) new_user_id
                                   from   (select cud.new_user_id, 
                                                  decode(cud.global_hr_id, null, null, lead(cud.user_id ignore nulls) over (partition by cud.global_hr_id order by cud.user_id)) nextUser
                                           from   css.user_desc cud
                                           where  upper(cud.user_type_code) in ('EMPLOYEE','CONTRACTOR','DIV_EMPLOYEE','DIV_CONTRACTOR','DIV_MYTEAPPROVED')) cudordered)
                           connect by nocycle user_id = prior new_user_id)

... but I'm effectively executing my 15 minute query twice.

I've looked at using partitions in the query, but there's not really a partition... I want to look at the full resultset of new_user_ids. Have also explored analytical functions such as rank()... my bag of tricks is empty.

Any ideas?

Clarification

The reason I don't want the extra records in the root list is because I only want one group of results for each user. I.E., if Bob Smith has had four accounts during his career here (people come and go frequently, as employees and/or contractors), I want to work with a set of accounts that all belong(ed) to Bob Smith.

If Bob came here as an contractor, converted to an employee, left, came back as a contractor in another country, and left/returned to a legal org that is now in our SAP system, his account rename/chain might look like:

Bob Smith  CONTRACTOR   ----    US0T0001  ->  US001101  (given a new ID as an employee)
Bob Smith  EMPLOYEE     ----    US001101  ->  EB0T0001  (contractor ID for  the UK)
Bob Smith  CONTRACTOR  SAP001   EB0T000T                (no rename performed)
Bob Smith  EMPLOYEE    SAP001   TE110001                (currently-active ID)

In the above example, the four accounts are linked by either a new_user_id field that was set when the user was renamed or through having the same SAP ID.

Because HR frequently fails to follow the business process, returning users may end up with any of those four ID being restored to them. I have to analyze all the IDs for Bob Smith and say "Bob Smith can only have TE110001 restored", and kick back an error if they try to restore something else. I have to do it for 90,000+ records.

The first column, "Bob Smith", is just an identifier to the group of associated accounts. In my original example, I'm using the root User ID as the identifier (e.g. US0T0001). If I use first/last names to identify users, I end up with collisions.

So Bob Smith would look like this:

US0T0001  1  CONTRACTOR   ----    US0T0001  ->  US001101  (given a new ID as an employee)
US0T0001  2  EMPLOYEE     ----    US001101  ->  EB0T0001  (contractor ID for  the UK)
US0T0001  3  CONTRACTOR  SAP001   EB0T0001                (no rename performed)
US0T0001  4  EMPLOYEE    SAP001   TE110001                (currently-active ID)

... where 1, 2, 3, 4 are the levels in the heirarchy.

Since US0T0001, US001101, EB0T0001, and TE110001 are all accounted for, I don't want another group for them. But the results I have now have those accounts listed in multiple groups:

US001101  1  EMPLOYEE     ----    US001101  ->  EB0T0001  (
US001101  2  CONTRACTOR  SAP001   EB0T0001                
US001101  3  EMPLOYEE    SAP001   TE110001               

EB0T0001  1  CONTRACTOR  SAP001   EB0T0001               
EB0T0001  2  EMPLOYEE    SAP001   TE110001                

US001101  1  EMPLOYEE    SAP001   TE110001

This causes two problems:

When I query the results for a User ID, I get hits from multiple groups
Each group will report a different expected user ID for Bob Smith.

You asked for an expanded set of records... here are some actual data:

-- NumRootUsers tells me how many accounts are associated with a user.
-- The new user ID field is explicitly set in the database, but may be null.
-- The calculated new user ID analyzes records to determine what the next related record is

          NumRoot                   New User    Calculated
RootUser  Users    Level  UserId    ID Field    New User ID   SapId       LastName        FirstName
-----------------------------------------------------------------------------------------------
BG100502  3        1      BG100502  BG1T0873    BG1T0873                  GRIENS VAN      KION
BG100502  3        2      BG1T0873  BG103443    BG103443                  GRIENS VAN      KION
BG100502  3        3      BG103443                            41008318    VAN GRIENS      KION

-- This group causes bad matches for Kion van Griens... the IDs are already accounted for,
-- and this group doesn't even grab all of the accounts for Kion.  It's also using a new 
-- ID to identify the group
BG1T0873  2        1      BG1T0873  BG103443    BG103443                  GRIENS VAN      KION
BG1T0873  2        2      BG103443                            41008318    VAN GRIENS      KION

-- Same here...
BG103443  1        1      BG103443                            41008318    VAN GRIENS      KION

-- Good group of records 
BG100506  3        1      BG100506              BG100778      41008640    MALEN VAN       LARS
BG100506  3        2      BG100778              BG1T0877      41008640    MALEN VAN       LARS
BG100506  3        3      BG1T0877                            41008640    VAN MALEN       LARS

-- Bad, unwanted group of records
BG100778  2        1      BG100778              BG1T0877      41008640    MALEN VAN       LARS
BG100778  2        2      BG1T0877                            41008640    VAN MALEN       LARS

-- Third group for Lars
BG1T0877  1        1      BG1T0877                            41008640    VAN MALEN       LARS


-- Jan... fields are set differently than the above examples, but the chain is calculated correctly
BG100525  3        1      BG100525              BG1T0894      41008651    ZANWIJK VAN     JAN
BG100525  3        2      BG1T0894  TE035165    TE035165      41008651    VAN ZANWIJK     JAN
BG100525  3        3      TE035165                            41008651    VAN ZANWIJK     JAN

-- Bad
BG1T0894  2        1      BG1T0894  TE035165    TE035165      41008651    VAN ZANWIJK     JAN
BG1T0894  2        2      TE035165                            41008651    VAN ZANWIJK     JAN

-- Bad bad
TE035165  1        1      TE035165                            41008651    VAN ZANWIJK     JAN


-- Somebody goofed and gave Ziano a second SAP ID... but we still matched correctly
BG100527  3        1      BG100527              BG1T0896      41008652    STEFANI DE      ZIANO
BG100527  3        2      BG1T0896  TE033030    TE033030      41008652    STEFANI DE      ZIANO
BG100527  3        3      TE033030                            42006172    DE STEFANI      ZIANO

-- And we still got extra, unwanted groups
BG1T0896  3        2      BG1T0896  TE033030    TE033030      41008652    STEFANI DE      ZIANO
BG1T0896  3        3      TE033030                            42006172    DE STEFANI      ZIANO

TE033030  3        3      TE033030                            42006172    DE STEFANI      ZIANO


-- Mark's a perfect example of the missing/frustrating data I'm dealing with... but we still matched correctly
BG102188  3        1      BG102188              BG1T0543      41008250    BULINS          MARK
BG102188  3        2      BG1T0543              TE908583      41008250    BULINS          R.J.M.A.
BG102188  3        3      TE908583                            41008250    BULINS          RICHARD JOHANNES MARTINUS ALPHISIUS

-- Not wanted
BG1T0543  3        2      BG1T0543              TE908583      41008250    BULINS          R.J.M.A.
BG1T0543  3        3      TE908583                            41008250    BULINS          RICHARD JOHANNES MARTINUS ALPHISIUS

TE908583  3        3      TE908583                            41008250    BULINS          RICHARD JOHANNES MARTINUS ALPHISIUS


-- One more for good measure
BG1T0146  3        1      BG1T0146  BG105905    BG105905                  LUIJENT         VALERIE
BG1T0146  3        2      BG105905              TE034165      42006121    LUIJENT         VALERIE
BG1T0146  3        3      TE034165                            42006121    LUIJENT         VALERIE

BG105905  3        2      BG105905              TE034165      42006121    LUIJENT         VALERIE
BG105905  3        3      TE034165                            42006121    LUIJENT         VALERIE

TE034165  3        3      TE034165                            42006121    LUIJENT         VALERIE

Not sure if all that info makes it clearer or will make your eyes roll back into your head : )

Thanks for looking at this!

its not very clear to me why FG096489 and A5093665 should be excluded from the root list - is it because they are root_users with new_user_id is null, or what ?could you give a fuller example of output from the query so we can see a few more combinations - with only 2 user ids all appearing somewher in all columns it is hard to see the pattern. — Hugh Jones, Oct 18 '12 at 16:38
The only reason for excluding them from the list is that I want them excluded : ) I'm open to other designs, but what I'm trying to do with the results requires there to only be one group per associated set of users. Will try to explain more in the question, thanks! — James King, Oct 19 '12 at 05:15
Are you able to drive the query by date? In other words 'STARTING WITH FirstEmployment' — Hugh Jones, Oct 19 '12 at 07:37
Nope : ( I'm dealing with data that is populated by hundreds of individual feeds of custom formats, a global feed in a standardized format that replace them by 2014, the ability to manually edit and add users through our intranet, and past support practices of updating the database directly to correct issues. Throw in country-specific HR rules that prevent certain fields from being populated, and I'm left with only being able to rely on the new_user_id (sometimes), and the SAP ID (most of the time). — James King, Oct 19 '12 at 17:26
That's a fair bit to go through, which is normally beyond the time SO users dedicate to a problem. If you post DDL and DML, perhaps more people will be inclined to take a look. — RichardTheKiwi, Oct 23 '12 at 19:49
Richard> Fair enough, I will try to simplify it and post DDL/DML. Essentially, though, what I'm asking is just `select user, manager from users where /*user isn't in the returned list of managers*/`... I don't care whether I solve it using a `start with` clause or a post-query filter... I just don't want to have to query the same data twice. — James King, Oct 23 '12 at 20:27
@James B - if you post some sample data to Sql Fiddle which covers all the cases you are trying to address I will have a look — Hugh Jones, Oct 25 '12 at 12:09

score 1 · Answer 1 · answered Oct 19 '12 at 08:27

1

I think I have it. We have allowed ourselves to become fixated on the chronological order whereas in fact it doesn't matter. Your START WITH clause should be 'NEW_USER_ID IS NULL'.

To get chronological order you could 'ORDER BY cudroot.node_level * -1'.

I would also recommend that you look at using a WITH clause to form your base data and perform the heirarchical query on that.

answered Oct 19 '12 at 08:27

Hugh Jones

2,584
14
28

Tried that... `new_user_id` is null will not always indicate a root node : ( In the examples I added, Lars van Malen has a null `new_user_id` for all three user IDs... they're connected by SAP ID instead. I can't say start where SAP ID is null, or both are null, or one is null and the other populated, because there are exceptions in every case. I have to find a way to say "no other node is a parent of this node". – James King Oct 19 '12 at 15:38
I haven't been able to make ORDER BY SIBLINGS work for some reason... I'm always told I can't use it, even though the query is a CONNECT BY. But I'm looking into possibilities there. JOC, when you write "ORDER BY cudroot.node_level * -1"... I've never seen an order by in that format or using an asterisk. What does that do to the sort order? – James King Oct 19 '12 at 17:30
'node level times minus one'. Are you able to generate a view of the data that yields the records in a form we can work with - In other words, with a field called Parent_ID? If so you can perform your heirarchical query on that. – Hugh Jones Oct 22 '12 at 10:45
I don't think I can - I think I would need this query to do that : P Kind of a which-comes-first issue. But whether I say `parent_id = prior user_id` or `user_id = prior new_user_id`, it's really the same thing. I've connected the records fine, I know the groups that are related... they even come back sorted correctly. I just need to exclude the children in the root list without querying the data twice. – James King Oct 22 '12 at 13:26
Here is a related question, on an approach I tried that doesn't work... it might make what I'm trying to do clearer. In fact, if I could make this question work, I would have the problem solved. http://stackoverflow.com/questions/12982568/not-in-query-odd-results – James King Oct 22 '12 at 13:28

score 1 · Answer 2 · answered Oct 30 '12 at 05:44

Perhaps what you need here is multiple queries. Each query will find a subset of the records you are trying to find. Each query will hopefully be simpler and faster than a single, ginormous query. Something like:

where new_user_id is null and SAP ID is null
where new_user_id is not null and SAP ID is null
where new_user_id is null and SAP ID is not null
where new_user_id is not null and SAP ID is not null

(these are of the cuff examples)

I think part of the problem with solving this conundrum is that the problem space is too large. By subdividing this problem into smaller pieces, each piece will be workable.

Excluding results that appear in another column of a CONNECT BY query

2 Answers2