Efficiently take min of a column with other column as the key

Question

I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as SELECT A, min_by(B, C) from <TABLE> GROUP BY A Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following

SELECT A, COALESCE(B, 0)
from <TABLE> as primary
JOIN (
SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A 
) secondary
ON primary.A = secondary.A AND primary.C = secondary.C\
GROUP BY A

I have 2 problems with this solution

It's not concise at all.
It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.

Is there a way to achieve what I am trying to do without writing your custom UDF ?

why not just sort your data and take the head instead of selecting for the min? https://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by — kpie, Nov 09 '18 at 00:08
@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one. — gibraltar, Nov 09 '18 at 00:13
can you post sample input so that we can work on a solution? — stack0114106, Nov 09 '18 at 04:54
Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics so look for tutorials about `MIN(x) OVER (PARTITION BY a,b) AS xx` or sthg like that — Samson Scharfrichter, Nov 09 '18 at 08:00

leftjoin · Accepted Answer · 2018-11-09T13:21:48.357

2

Join works slower than analytic functions, try this approach without join, and table will be scanned only once:

select s.*
from
(
SELECT A, COALESCE(B, 0) as B, C
      min(C) over (partition by A) as min_C
from <TABLE> as primary
)s
where s.C=s.min_C;

If you need min(C) to be calculated by more group columns, add them to the partition BY clause.

edited Nov 09 '18 at 13:21

answered Nov 09 '18 at 13:16

leftjoin

28,302
6
46
84

score 0 · Answer 2 · answered Nov 03 '20 at 06:03

0

You can try TD_first(B,C) in hive. Works in the same fashion.

answered Nov 03 '20 at 06:03

user14568460

1

Efficiently take min of a column with other column as the key

2 Answers2