11

I'm having trouble formulating a query for the following problem:

For pair values that have a certain score, how do you group them in way that will only return distinct pair values with the best respective scores?

For example, lets say I have a table with the following row values:

(t1,p1,65)
(t1,p2,60)
(t1,p3,20)
(t2,p1,60)
(t2,p2,59)
(t2,p3,15)

The first two columns indicate the pair values and the third column represents the pair score.The best score is (t1,p1,65). Since t1 and p1 are now used, I want to exclude them from further analysis.

The next best score is (t2,p2,59). Even though (t1,p2) has a score of 60, I want to exclude it because "t1" has already been used. (t2,p1) also has a score of 60, but since p1 is also already used, this pair is excluded.

This results in the distinct pair score values of:

(t1,p1,65)
(t2,p2,59)

Is there any way to generate this result with just a query? I've tried to think of ways of grouping and partitioning the results, but since there has to be some accounting of values already used according to score rank, I'm finding this very difficult to approach.

EDIT:

To generate the data:

with t(t, p, score) as (
    (values ('t1','p1',65),
           ('t1','p2',60),
           ('t1','p3',20),
           ('t2','p1',60),
           ('t2','p2',59),
           ('t2','p3',15)
     ))
select t.* from t;
TheGameiswar
  • 25,396
  • 5
  • 48
  • 82
  • that's an interesting question :) – borowis Nov 01 '16 at 17:26
  • 2
    why do you have `(t2, p1, 60)` in result if you said we have to exclude `p1`? – borowis Nov 01 '16 at 17:30
  • 2
    You would need to solve this using recursive CTEs. – Gordon Linoff Nov 01 '16 at 17:43
  • Your example excludes `t1` but not `p1`. Is that right? If so then you may need to clarify with a few more examples. – Philip Couling Nov 01 '16 at 17:56
  • This was my mistake. The correct result should be: (t1,p1,65) (t2,p2,59) – Stephen Tableau Nov 01 '16 at 18:06
  • 1
    After playing around with this, it seems harder than I thought. Postgres (and most databases) have certain limitations on recursive CTEs that preclude the natural way to express this. There is something about this problem that makes me suspect that it cannot be handled with recursive CTEs. It has something to do with the fact the there are two distinct steps: finding the maximum score remaining and then removing nodes from consideration. – Gordon Linoff Nov 01 '16 at 18:43
  • Yea I spent a good couple hours on it. I resorted to just doing all the calculations in Python then porting the results back into my DB, but it would be so much more efficient to come up with a query. – Stephen Tableau Nov 01 '16 at 18:50

4 Answers4

4

This problem has obviously been bothering me. The following appears to implement your logic, keeping arrays of visited values in rows:

with recursive t(t, p, score) as (
    (values ('t1','p1',65),
           ('t1','p2',60),
           ('t1','p3',20),
           ('t2','p1',60),
           ('t2','p2',59),
           ('t2','p3',15)
     )),
     cte(t, p, score, cnt, lastt, lastp, ts, ps) as (
        (select t.*, count(*) over ()::int, tt.t, tt.p, ARRAY[tt.t], ARRAY[tt.p]
         from t cross join
              (select t.* from t order by score desc limit 1) tt
        ) 
        union all
        select t, p, score, 
               sum(case when not (ts @> ARRAY[t] or ps @> ARRAY[p]) then 1 else 0 end) over ()::int,
               first_value(t) over (order by case when not (ts @> ARRAY[t] or ps @> ARRAY[p]) then score end desc nulls last),
               first_value(p) over (order by case when not (ts @> ARRAY[t] or ps @> ARRAY[p]) then score end desc nulls last),
               ts || first_value(t) over (order by case when not (ts @> ARRAY[t] or ps @> ARRAY[p]) then score end desc nulls last),
               ps || first_value(p) over (order by case when not (ts @> ARRAY[t] or ps @> ARRAY[p]) then score end desc nulls last)
        from cte 
        where cnt > 0
       )
 select *
 from cte
 where lastt = t and lastp = p and cnt > 0;
Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
3

It is relatively simple using the stored function:

--drop function if exists f();
--drop table if exists t;
create table t(x text,y text, z int);
insert into t values
  ('t1','p1',65),
  ('t1','p2',60),
  ('t1','p3',20),
  ('t2','p1',60),
  ('t2','p2',59),
  ('t2','p3',15)/*,
  ('t3','p1',20),
  ('t3','p2',60),
  ('t3','p3',40)*/;

create function f() returns setof t immutable language plpgsql as $$
declare
  ax text[];
  ay text[];
  r t;
begin
  ax := '{}'; ay := '{}';
  loop
    select * into r
      from t
      where x <> all(ax) and y <> all(ay)
      order by z desc, x, y limit 1;
    exit when not found;
    ax := ax || r.x; ay := ay || r.y;
    return next r;
  end loop;
end $$;

select * from f();
╔════╤════╤════╗
║ x  │ y  │ z  ║
╠════╪════╪════╣
║ t1 │ p1 │ 65 ║
║ t2 │ p2 │ 59 ║
╚════╧════╧════╝

However if uncomment the third bunch of the values the result will be different:

╔════╤════╤════╗
║ x  │ y  │ z  ║
╠════╪════╪════╣
║ t1 │ p1 │ 65 ║
║ t3 │ p2 │ 60 ║
║ t2 │ p3 │ 15 ║
╚════╧════╧════╝

Upd: and the equivalent using recursive CTE on the same test data:

with recursive r as (
  (select x, y, z, array[x] as ax, array[y] as ay from t order by z desc, x, y limit 1)
  union all
  (select t.x, t.y, t.z, r.ax || t.x, r.ay || t.y from t, r 
  where not (t.x = any(r.ax) or t.y = any(r.ay)) 
  order by t.z desc, t.x, t.y limit 1))
select * from r;
Abelisto
  • 12,110
  • 2
  • 24
  • 35
  • Beautiful. Thank you! – Stephen Tableau Nov 02 '16 at 06:09
  • I actually tried this out on a dataset with a couple hundred thousand rows, but it was very slow because of the recursion. Any ideas on how to speed up the efficiency here? – Stephen Tableau Nov 02 '16 at 13:38
  • @StephenTableau It is hard to say anything without the knowledge about the actual data structure, selectivity, indexes and so on. Try to create the indexes which covered `where` and `order by` parts (for my example it would be `...on t(x,y)` and `...on t(z desc)`). Is the [Gordon Linoff's solution](http://stackoverflow.com/a/40367066/593144) faster? Probably it was the more efficient solution. – Abelisto Nov 02 '16 at 17:11
2

t1 was used so you've excluded (t1,p2) but p1 was also used and you haven't excluded it. For me it looks like just groupping by first column.

select t1.c1, t2.c2, t1.s 
  from table1 t2 
  inner join (select c1, max(score) s from table1 group by t1) t1 
    on (t1.s=t2.score and t1.c1=t2.c1);

Where table1 is name for your table and c1 is first, c2 second and score third column;

Kacper
  • 4,773
  • 2
  • 15
  • 31
1

If first pair value, and second pair value are different columns (say, X and Y) you could group by X and do MAX(score) as an aggregation function to get the max score for tuples starting with X.

Further steps depend on your data, because you might still get undesired duplicates if every tuple is reversed. So to exclude such reversed tuples you might do a self-join first.

borowis
  • 1,171
  • 9
  • 16