11

I need to calculate median value of a numeric sequence in Google BigQuery efficiently. Is the same possible?

Manish Agrawal
  • 127
  • 1
  • 1
  • 5
  • You have a low rate. Important on SO, you have to mark accepted answers by using the tick on the left of the posted answer, below the voting. This will increase your rate. See how this works by visinting this link: http://meta.stackoverflow.com/questions/5234/how-does-accepting-an-answer-work#5235 – Pentium10 Sep 30 '16 at 06:27
  • see also https://stackoverflow.com/questions/51981353/bigquery-sql-average-geometric-mean-remove-outliers-median – Felipe Hoffa Nov 13 '18 at 01:26

3 Answers3

16

Yeah it's possible with PERCENTILE_CONT window function.

Returns values that are based upon linear interpolation between the values of the group, after ordering them per the ORDER BY clause.

must be between 0 and 1.

This window function requires ORDER BY in the OVER clause.

So an example query would be like (the max() is there just to work across the group by but it's not being used as a math logic, should not confuse you)

SELECT room,
      max(median) FROM   (SELECT room,
         percentile_cont(0.5) OVER (PARTITION BY room
                                    ORDER BY temperature) AS median    FROM
    (SELECT 1 AS room,
            11 AS temperature),
    (SELECT 1 AS room,
            12 AS temperature),
    (SELECT 1 AS room,
            14 AS temperature),
    (SELECT 1 AS room,
            19 AS temperature),
    (SELECT 1 AS room,
            13 AS temperature),
    (SELECT 2 AS room,
            20 AS temperature),
    (SELECT 2 AS room,
            21 AS temperature),
    (SELECT 2 AS room,
            29 AS temperature),
    (SELECT 3 AS room,
            30 AS temperature)) GROUP BY room

This returns:

+------+-------------+
| room | temperature |
+------+-------------+
|    1 |          13 |
|    2 |          21 |
|    3 |          30 |
+------+-------------+
Pentium10
  • 190,605
  • 114
  • 394
  • 474
  • 2
    Can we have a little clearer and concise query please? I could not understand the above. – Manish Agrawal Mar 17 '15 at 11:32
  • @ManishAgrawal Try running in pieces and you will eventually understand, this query is simple. Probably what's new for you is the OVER() thing, which you need to read further, it's the base for Window Functions. In case the from clause confuses you, I tried to replicate a table results so that you can copy paste and run this query as is. – Pentium10 Mar 17 '15 at 12:21
  • 1
    the room 1 has the values 11,12,14,19,13 should not be the median 14? – Andres Urrego Angel Dec 21 '18 at 20:35
  • @AndresUrregoAngel the median value is derived from an ordered range of values. Sure, in the order you describe, above, the value '14' is in the middle. However, this is not an ordered list. The ordered list would be `11, 12, 13, 14, 19`. Therefore, '13' is the correct median value – Fab Dot Feb 25 '20 at 14:56
7

Alternative solution, when you don't need absolutely exact results and approximation is fine - you can use combination of NTH and QUANTILES aggregation functions. The advantage of this method is that it is much more scalable than analytic window functions, but the disadvantage is that it gives approximate results.

SELECT room,
       NTH(50, QUANTILES(temperature, 101)) FROM
    (SELECT 1 AS room,
            11 AS temperature),
    (SELECT 1 AS room,
            12 AS temperature),
    (SELECT 1 AS room,
            14 AS temperature),
    (SELECT 1 AS room,
            19 AS temperature),
    (SELECT 1 AS room,
            13 AS temperature),
    (SELECT 2 AS room,
            20 AS temperature),
    (SELECT 2 AS room,
            21 AS temperature),
    (SELECT 2 AS room,
            29 AS temperature),
    (SELECT 3 AS room,
            30 AS temperature) GROUP BY room

This returns

room temperature 
1    13  
2    21  
3    30
Mosha Pasumansky
  • 10,912
  • 2
  • 25
  • 48
  • I think you need `NTH(51, QUANTILES(temperature, 101))` for the median, since `NTH` is 1-based. See https://cloud.google.com/bigquery/query-reference#quantiles – Richard Poole Aug 04 '16 at 20:53
7

2018 update with more metrics:

BigQuery SQL: Average, geometric mean, remove outliers, median


For my own memory purposes, working queries with taxi data:

Approximate quantiles:

SELECT MONTH(pickup_datetime) month, NTH(51, QUANTILES(tip_amount,101)) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1

Gives the same results as PERCENTILE_DISC:

SELECT month, FIRST(median) median
FROM (
  SELECT MONTH(pickup_datetime) month, tip_amount, PERCENTILE_DISC(0.5) OVER(PARTITION BY month ORDER BY tip_amount) median
  FROM [nyc-tlc:green.trips_2015]
  WHERE tip_amount > 0
)
GROUP BY 1
ORDER BY 1

StandardSQL:

#StandardSQL
SELECT DATE_TRUNC(DATE(pickup_datetime), MONTH) month, APPROX_QUANTILES(tip_amount,1000)[OFFSET(500)] median
FROM `nyc-tlc.green.trips_2015`
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1
Felipe Hoffa
  • 39,978
  • 8
  • 105
  • 241