how to use order by with collect_set() operation in hive

Question

In Table 1, I have customer_id, item_id and item_rank (rank of item according to some sales). I want to collect a list of items for each customer_id and arrange them according to item_rank.

Customer_id  item_id rank_item
  23            2      3
  23            2      3
  23            4      2
  25            5      1
  25            4      2

The output I expect is

Customer_id    item_list
  23             4,2
  25             5,4

The code I used was

 SELECT
    customer_id,
    concat_ws(',',collect_list (string(item_id))) AS item_list
FROM
    table1
GROUP BY
    customer_id
ORDER BY
    item_rank

Chris Nauroth · Answer 1 · 2017-07-17T21:00:23.490

You can use a sub-query to get a result set of (customer_id, item_id, item_rank), sorted by item_rank, and then use collect_set in the outer query.

Query

WITH table1 AS (
    SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
    SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
    SELECT 23 AS customer_id, 4 AS item_id, 2 AS item_rank UNION ALL
    SELECT 25 AS customer_id, 5 AS item_id, 1 AS item_rank UNION ALL
    SELECT 25 AS customer_id, 4 AS item_id, 2 AS item_rank
)
SELECT
    subquery.customer_id,
    collect_set(subquery.item_id) AS item_id_set
FROM (
    SELECT
        table1.customer_id,
        table1.item_id,
        table1.item_rank
    FROM table1
    DISTRIBUTE BY
        table1.customer_id
    SORT BY
        table1.customer_id,
        table1.item_rank
) subquery
GROUP BY
    subquery.customer_id
;

Results

    customer_id item_id_set
0   23  [4,2]
1   25  [5,4]

The sub-query uses DISTRIBUTE BY to guarantee that all rows for a particular customer_id route to the same reducer. It then uses SORT BY to sort by customer_id and item_rank within each reducer. I expect this is sufficient for the requirements, because I didn't notice a requirement for total ordering of the final result set. (If total ordering by customer_id is a requirement, then I think the query would have to use ORDER BY, which would cause slower execution.)

Internally, the collect_set UDAF uses a Java LinkedHashSet, which is an order-preserving collection, so the same sort order used in the sub-query will be maintained in the outer query's set. This is visible in the Hive codebase here:

https://github.com/apache/hive/blob/release-2.0.0/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java#L93

Thanks Chris, but won't there be a problem with a larger data set? From what I can see Sort by or even Cluster by does not ensure Global ordering https://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by — vkaul11, Jul 15 '17 at 21:04
@vkaul11 , great observation! I've updated the answer to use `DISTRIBUTE BY` and `SORT BY`. (I didn't notice a requirement for total ordering, so I didn't use `ORDER BY`.) — Chris Nauroth, Jul 17 '17 at 21:01
No, Chris total ordering is not a requirement so indeed it suffices. — vkaul11, Jul 21 '17 at 18:53

score 0 · Answer 2 · answered Mar 26 '19 at 14:24

0

SELECT customer_id, collect_set(item_id) AS item_list FROM table1 GROUP BY customer_id ORDER BY item_rank

NOTE : Using collect_list() gives you duplicates and collect_set() gives you unique values.

answered Mar 26 '19 at 14:24

Tutu Kumari

377
2
9

how to use order by with collect_set() operation in hive

2 Answers2

Linked