1

I would like to calculate the rolling median for a column in Greenplum, i.e. as below:

|  x | rolling_median_x |
| -- + ---------------- |
|  4 |                4 |
|  1 |              2.5 |
|  3 |                3 |
|  2 |              2.5 |
|  1 |                2 |
|  6 |              2.5 |
|  9 |                3 |

x is an integer and for each row rolling_median_x shows the median of x for the current and preceding rows. E.g. for the third row rolling_median_x = median(4, 1, 3) = 3.

Things I've found out so far:

  • the median function can't be used as a framed window function, i.e. median(x) OVER(RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
  • the same is true for many other function such as percent_rank or nth_value
  • recursive self join is not supported in this version of Greenplum

As a matter of fact I was unable to find proper documentation on which functions can be used as framed window function in Greenplum...

I'm using Greenplum 4.3.4.0 (which is based on Postgres 8.2.15) and updating is not an option unfortunately.

a_horse_with_no_name
  • 440,273
  • 77
  • 685
  • 758
PeterAL
  • 57
  • 6
  • 8.2.15? Are you sure it isn't 9.2.15? I'm not even sure 8.2 has window functions... – bobflux Aug 29 '17 at 16:13
  • 2
    Documentation for 8.2: https://www.postgresql.org/docs/8.2/static/index.html. Window functions were not introduced until 8.4. median() is not a postgresql function - perhaps a user defined aggregate specific to your installation? 8.2 has not been supported for years. https://stackoverflow.com/search?q=window+functions+%5Bmysql%5D shows examples of working around a lack of window functions. – gwaigh Aug 29 '17 at 16:37
  • You are out of luck with such an ancient Postgres version. – a_horse_with_no_name Aug 29 '17 at 16:45
  • "rolling_median_x shows the median of x for the current and preceding rows" => you mean all the rows from the beginning? or like the last 10 rows? With lots of rows, this could be very slow... – bobflux Aug 29 '17 at 16:47
  • @peufeu Yes, unfortunately I'm sure. It's a Greenplum db which is built on 8.2. It returns PostgreSQL 8.2.15 (Greenplum Database 4.3.4.0 build 1) [...] for the `SELECT version()` command. For the rolling median definition: yes, all the rows from the beginning (unbounded preceding). @gwaigh I'm not sure how window functions weren't introduced when I can use other functions. E.g. I can use `avg(x) OVER( PARTITION BY... ORDER BY... RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` @a_horse_with_no_name thanks for your corrections – PeterAL Aug 30 '17 at 07:35

2 Answers2

1

One remark - a citate from Wikipedia: ORDER BY

ORDER BY is the only way to sort the rows in the result set. Without this clause, the relational database system may return the rows in any order. If an ordering is required, the ORDER BY must be provided in the SELECT statement sent by the application. Although some database systems allow the specification of an ORDER BY clause in subselects or view definitions, the presence there has no effect. A view is a logical relational table, and the relational model mandates that a table is a set of rows, implying no sort order whatsoever.


Since you need to calculate the median for the current and preceding rows, you must have in the table an additional row that defines the order of rows and can be used to determine which rows precede given row and which ones come after.
Let say some id column like this:

| id | x | rolling_median_x |
|----|---|------------------|
|  1 | 4 |                4 |
|  2 | 1 |              2.5 |
|  3 | 3 |                3 |
|  4 | 2 |              2.5 |
|  5 | 1 |                2 |
|  6 | 6 |              2.5 |
|  7 | 9 |                3 |

If you cannot use analytic functions, then try pure SQL.
This article shows various methods of computing the Median with SQL.
I think the Henderson’s Median would be best for our needs:

SELECT CASE COUNT(*) % 2
       WHEN 0        -- even sized table
       THEN (P1.part_wgt + MIN(CASE WHEN P2.part_wgt > P1.part_wgt
                                  THEN P2.part_wgt
                                  ELSE NULL END))/2.0
       ELSE P1.part_wgt --odd sized table
       END AS median 
  FROM Parts AS P1, Parts AS P2
 GROUP BY P1.part_wgt
HAVING COUNT(CASE WHEN P1.part_wgt >= P2.part_wgt
                  THEN 1
                  ELSE NULL END)
       = (COUNT(*) + 1) / 2;

Just run this query for each row as a dependent subquery, a general idea is like this:

SELECT t.*, (
        SELECT .... Henderson's query FROM table x
        WHERE x.id <= t.id
        ......
       ) As our_median
FROM table t

You can find an example implementation in this demo

SELECT t.*, (
    SELECT CASE COUNT(*) % 2
           WHEN 0        -- even sized table
           THEN (P1.x + MIN(CASE WHEN P2.x > P1.x
                                      THEN P2.x
                                      ELSE NULL END))/2.0
           ELSE P1.x --odd sized table
           END AS median 
      FROM Table333 AS P1, Table333 AS P2
      WHERE p1.id <= t.id AND p2.id <= t.id
     GROUP BY P1.x
    HAVING COUNT(CASE WHEN P1.x >= P2.x
                      THEN 1
                      ELSE NULL END)
           = (COUNT(*) + 1) / 2
    ) as Our_median
FROM Table333 t;

| id | x | rolling_median_x | our_median |
|----|---|------------------|------------|
|  1 | 4 |                4 |          4 |
|  2 | 1 |              2.5 |        2.5 |
|  3 | 3 |                3 |          3 |
|  4 | 2 |              2.5 |        2.5 |
|  5 | 1 |                2 |          2 |
|  6 | 6 |              2.5 |        2.5 |
|  7 | 9 |                3 |          3 |

This query will probably be slow - this is a price you must pay for having ancient version of PostgreSQL

krokodilko
  • 32,441
  • 5
  • 40
  • 67
0

I'm using psql 8.2.15 and updating is not an option unfortunately.

Ouch.

If it was the rolling mean, things would be simple, but the rolling median will be very slow due to the need for sorting. The way to avoid this is to insert the values into a heap or a btree as they come which allows to get the rolling median without sorting on each new value. But this needs custom code.

I would use plpython to implement this:

Rolling median algorithm in C

bobflux
  • 9,971
  • 2
  • 24
  • 23