2

I am pulling my hair out trying to create a running / cumulative median of a partitioned value, in a chronological ordering. Basically I have a table:

create table "SomeData"
(
    ClientId INT,
    SomeData DECIMAL(10,2),
    SomeDate TIMESTAMP
);

With some data:

INSERT INTO "SomeData" (ClientId, SomeData, SomeDate) VALUES
(1, 1, '1 Jan 2000'),
(1, 2, '2 Jan 2000'),
(1, 3, '3 Jan 2000'),
(1, 4, '4 Jan 2000'),
(2, 100, '1 Jan 2000'),
(2, 100, '2 Jan 2000'),
(2, 100, '3 Jan 2000'),
(2, 200, '4 Jan 2000'),
(2, 200, '5 Jan 2000'),
(2, 200, '6 Jan 2000'),
(2, 200, '7 Jan 2000');

And I need a running median partitioned by ClientId, ordered by SomeDate.

Basically, what I need to produce is this:

ClientId    SomeDate      Median of SomeData
1           "2000-01-01"  1.000
1           "2000-01-02"  1.500
1           "2000-01-03"  2.000
1           "2000-01-04"  2.500
2           "2000-01-01"  100.0
2           "2000-01-02"  100.0
2           "2000-01-03"  100.0
2           "2000-01-04"  100.0
2           "2000-01-05"  100.0
2           "2000-01-06"  150.0
2           "2000-01-07"  200.0

I am able to do this in a number of ways in in PostgresSql 9.x with the Aggregate_median function, however this is proving to be difficult in Redshift, which only has an aggregate median

SELECT ClientId, SomeDate, median(SomeData) OVER (PARTITION BY ClientId ORDER BY SomeDate)
FROM "SomeData" xout
ORDER BY ClientId, SomeDate;

Running the above on Redshift however gives the error:

ERROR: window specification should not contain frame clause and order-by for window function median

The median can be replaced with a manual correlated subquery back to the original table, however RedShift doesn't appear to support these either.

ERROR: This type of correlated subquery pattern is not supported due to internal error

Here are a bunch of fiddles which work in PostGres, none of which work in Redshift

At this point it looks like I'll need to pull the data into memory and do this in code, but would be appreciative if this can be done in Redshift directly.

Community
  • 1
  • 1
StuartLC
  • 96,413
  • 17
  • 181
  • 256

3 Answers3

2

I wonder if you can do this with nth_value():

SELECT ClientId, SomeDate,
       NTH_VALUE(seqnum / 2) OVER (PARTITION BY ClientId ORDER BY SomeDate)
FROM (SELECT s.*,
             COUNT(*) OVER (PARTITION BY ClientId ORDER BY SomeDate) as seqnum
      FROM SomeData s
     ) s
ORDER BY ClientId, SomeDate;

As a note: that use of COUNT(*) instead of ROW_NUMBER() takes some getting used to.

Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
  • Thanks Gordon - I needed a couple of tweaks but was able to get [this working here](http://sqlfiddle.com/#!15/0edb0/3). Obviously this is a slight approximation for even counts of data, viz the 2 middlemost points should be interpolated, but this will surely suffice for 99% of applications. – StuartLC Jun 18 '15 at 12:20
  • @StuartLC . . . It is easy enough to modify for even counts. It just makes the query itself a lot more complicated. It's cool that this works in Red Shift. Makes me wonder why they implemented this functionality but not `median()` with a window clause. – Gordon Linoff Jun 18 '15 at 13:00
  • @GordonLinoff actually, I would love to see the modification for even counts, since Redshift won't evaluate nth_value( seqnum/2::INT ) when seqnum = 1, even with a case statement preventing the case seqnum = 1. – Korbonits Apr 15 '16 at 01:09
  • @Korbonits . . . Could you ask another question? This requires a bit of thought. – Gordon Linoff Apr 15 '16 at 01:20
  • @GordonLinoff I actually came up with a solution in the mean time but it's not sexy... LMK best way to share. – Korbonits Apr 15 '16 at 01:38
  • 1
    @StuartLC see below for an implementation that is not an approximation :) http://stackoverflow.com/a/36636914/3320944 – Korbonits Apr 15 '16 at 01:49
2

I think the solution presented by @GordonLinoff is not correct because it does not order the rows with the value you are trying to find median of. The correct way inspired by :

Moving Median, Mode in T-SQL

works on redshift:

WITH CTE
AS
(
SELECT  ClientId,
        ROW_NUMBER() OVER (PARTITION BY ClientId ORDER BY SomeDate ASC) row_num,
        SomeDate,
        SomeData
FROM "SomeData" 
)
SELECT A.SomeDate,
       A.SomeData,
                (SELECT  MEDIAN(B.SomeData)
                FROM CTE B 
                WHERE B.row_num BETWEEN 1 AND A.row_num 
                GROUP BY A.ClientId) AS median
FROM CTE A
Bahman
  • 21
  • 2
1

This is an exact computation of the quantity that you are looking for.

Not sexy per se but it properly handles medians for odd vs. even length.

with row_numbers as (
    SELECT d.partitionField -- the field (or fields) you are partitioning the window function by
         , d.orderField  -- your sort field for the window functions
         , d.medianField -- quantity your are computing the median of

         , ROW_NUMBER() 
           OVER (PARTITION BY partitionField ORDER BY orderField) as seqnum

    FROM data d
)

, medians as (    
    SELECT nth_value(medianField, CASE 
                                  WHEN mod(seqnum, 2) = 0 THEN (seqnum/2)::int 
                                  ELSE ((seqnum/2)::int + 1) 
                                  END) 
           OVER (PARTITION BY partitionField ORDER BY orderField ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as median1

         , nth_value(medianField, (seqnum/2)::int + 1) OVER (PARTITION BY partitionField ORDER BY orderField ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as median2

         , mod(seqnum, 2) as mod1
    FROM row_numbers
    ORDER BY partitionField, orderField
)

select CASE
       when mod(mod1,2) = 0
       then ((median1 + median2)/2)::FLOAT
       else median1
       end as median
from medians
Korbonits
  • 101
  • 5