2

In the data set, every shop is selling some books and every shop has its own price for each book. In the data, I have the price information for each book. With the query in Amazon Athena, I want to calculate the median price for each shop and each product in a specific time period.

But honestly, I have no idea how to do it. Here is my query so far:

SELECT product_id,
       shop_id,
       XXX AS median_price
FROM data_f
    WHERE site_id = 10
            AND year || month || day || hour >= '2020022500'
            AND year || month || day || hour < '2020022600'
GROUP BY product_id, shop_id

Thanks!

datazang
  • 655
  • 1
  • 9
  • Start here: https://stackoverflow.com/search?q=calculate+median+price+ – Luuk Feb 28 '20 at 10:57
  • Median is traditionally a difficult value to calculate in SQL. Unless the database directly supports median (and Amazon Athena does not), then you'll need to write some code to calculate it. You might be able to use: [Querying with User Defined Functions (Preview) - Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/querying-udf.html) – John Rotenstein Feb 28 '20 at 11:07

4 Answers4

3

Unfortunately, AWS doesn't support a median() aggregation function or the percentile() functions. Perhaps the simplest method is to use ntile(2) in a subquery and then take the maximum of the first tile (or the minimum of the second tile:

SELECT product_id, shop_id,
       MAX(CASE WHEN tile2 = 1 THEN price END) as median
FROM (SELECT d.*, NTILE(2) OVER (PARTITION BY product_id, shop_id ORDER BY price) as tile2
      FROM data_f d
      WHERE site_id = 10 AND
            action NOT IN ('base', 'delete') AND
            year || month || day || hour >= '2020022500' AND
            year || month || day || hour < '2020022600'
     ) d
GROUP BY product_id, shop_id;

Note: This is undoubtedly good enough for any practical purpose. However, "median" is usually defined as the average of the two middle values when the total number of rows is even. If you want to be pedantic:

SELECT product_id, shop_id,
       (CASE WHEN COUNT(*) % 2 = 0
             THEN (MAX(CASE WHEN tile2 = 1 THEN price END) +
                   MIN(CASE WHEN tile2 = 2 THEN price END)
                  ) / 2.0
             ELSE MAX(CASE WHEN tile2 = 1 THEN price END)
        END) as median
Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
2

The median value is the one in the middle when all are listed in order, so let's create that order with a dense_rank()

with q1 as
(
SELECT product_id,
       shop_id,
       price, 
       dense_rank() over (partition by product_id, shop_id order by price) as price_rank
FROM data_f
    WHERE site_id = 10
            AND action <> 'base'
            AND action <> 'delete'
            AND year || month || day || hour >= '2020022500'
            AND year || month || day || hour < '2020022600'
)
, q2 as 
(
select max(price_rank) as mp
from q1
)
select q1.*
from q1
where q1.price_rank = (select floor(mp/2) from q2)

Documentation of window functions is part of the Presto Functions documentation here

JohnHC
  • 10,417
  • 1
  • 19
  • 36
1

You can use approx_percentile

    select approx_percentile(column_name, 0.5) from table

solution from Philipp Johannis Calculate Median for each group in AWS Athena table

    SELECT product_id,
           shop_id,
           approx_percentile(price, 0.5) AS median_price
    FROM data_f
        WHERE site_id = 10
                AND year || month || day || hour >= '2020022500'
                AND year || month || day || hour < '2020022600'
    GROUP BY product_id, shop_id
aembick
  • 11
  • 3
0

Below query for calculating median:

with res1 as
(select id,ROW_NUMBER() over (order by id) "median_row_num" from test ),
res2 as
(select count(median_row_num) as i  from res1)
select id as "median" from res1 where res1.median_row_num = (select  case when i%2 = 0 then i/2 else i/2+1 end from res2)

Note : Remember median is middle element in sorted list of numbers.

if a = [3,4,2,6,7]

sorted list a = [2,3,4,6,7]

count of elements is 5 so median would be 4.

But in case of if a = [2,3,4,6,7,8]

Count of elements 6 which is even number so there are two mid elements 4 and 6

So median would be 5 (4+6 = 10/2 = 5)

So above query is good for odd counts and incase of even counts it will always give you first half element.

Pankaj_Dwivedi
  • 555
  • 3
  • 12