5

I am working with SQL Server 2008

If I have a Table as such:

Code   Value
-----------------------
4      240
4      299
4      210
2      NULL
2      3
6      30
6      80
6      10
4      240
2      30

How can I find the median AND group by the Code column please? To get a resultset like this:

Code   Median
-----------------------
4      240
2      16.5
6      30

I really like this solution for median, but unfortunately it doesn't include Group By: https://stackoverflow.com/a/2026609/106227

Community
  • 1
  • 1
Stu Harper
  • 641
  • 1
  • 9
  • 20
  • Possible duplicate of http://stackoverflow.com/questions/1342898/function-to-calculate-median-in-sql-server – jean Dec 13 '13 at 12:29
  • I don't think that link actually deals with grouping by a second column. – Stu Harper Dec 13 '13 at 12:33
  • In that link the OP states "What would be the best way (if possible) to do this - allow for the calculation of a median value (assuming a numeric data type) in an aggregate query?" It usually means he ill using a group by (AGGREGATE). – jean Dec 13 '13 at 12:37
  • @StuHarper: *"it doesn't include Group By"*...No it does!! – huMpty duMpty Dec 13 '13 at 12:54

2 Answers2

9

The solution using rank works nicely when you have an odd number of members in each group, i.e. the median exists within the sample, where you have an even number of members the rank method will fall down, e.g.

1
2
3
4

The median here is 2.5 (i.e. half the group is smaller, and half the group is larger) but the rank method will return 3. To get around this you essentially need to take the top value from the bottom half of the group, and the bottom value of the top half of the group, and take an average of the two values.

WITH CTE AS
(   SELECT  Code,
            Value, 
            [half1] = NTILE(2) OVER(PARTITION BY Code ORDER BY Value), 
            [half2] = NTILE(2) OVER(PARTITION BY Code ORDER BY Value DESC)
    FROM    T
    WHERE   Value IS NOT NULL
)
SELECT  Code,
        (MAX(CASE WHEN Half1 = 1 THEN Value END) + 
        MIN(CASE WHEN Half2 = 1 THEN Value END)) / 2.0
FROM    CTE
GROUP BY Code;

Example on SQL Fiddle


In SQL Server 2012 you can use PERCENTILE_CONT

SELECT  DISTINCT
        Code,
        Median = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Value) OVER(PARTITION BY Code)
FROM    T;

Example on SQL Fiddle

GarethD
  • 61,378
  • 9
  • 75
  • 114
2

SQL Server does not have a function to calculate medians, but you could use the ROW_NUMBER function like this:

WITH RankedTable AS (
    SELECT Code, Value, 
        ROW_NUMBER() OVER (PARTITION BY Code ORDER BY VALUE) AS Rnk,
        COUNT(*) OVER (PARTITION BY Code) AS Cnt
    FROM MyTable
)
SELECT Code, Value
FROM RankedTable
WHERE Rnk = Cnt / 2 + 1

To elaborate a bit on this solution, consider the output of the RankedTable CTE:

Code   Value   Rnk    Cnt
---------------------------
4      240     2      3   -- Median
4      299     3      3
4      210     1      3
2      NULL    1      2
2      3       2      2   -- Median
6      30      2      3   -- Median
6      80      3      3
6      10      1      3

Now from this result set, if you only return those rows where Rnk equals Cnt / 2 + 1 (integer division), you get only the rows with the median value for each group.

Dan
  • 9,855
  • 17
  • 45
  • Thanks Dan. This nearly works, but not quite. I added another row (4, 240). This results in your query missing out Code 4 completely from the resultset. – Stu Harper Dec 13 '13 at 12:53
  • Ah yeah, you'll get in trouble when you have multiple identical values, as these are assigned the same RANK. I'm editing my solution to use the ROW_NUMBER function instead, as this will assign a unique value to each row. – Dan Dec 13 '13 at 12:54
  • Again, thanks Dan but it still doesn't quite work. I added another row (2,30) and your query then results in 2,30 rather than 2,16.5. GarethD's solution below handles that OK. – Stu Harper Dec 13 '13 at 13:30
  • How can 16,5 be the median, if the number doesn't even exist in the original dataset? – Dan Dec 13 '13 at 13:33
  • 1
    @Dan The median doesn't have to be in the original dataset. According to [Wikipedia](http://en.wikipedia.org/wiki/Median): *"If there is an even number of observations, then there is no single middle value; the median is then usually defined to be the mean of the two middle values"* – GarethD Dec 13 '13 at 13:37
  • Thanks @GarethD. I should have paid more attention in statistics class :-) – Dan Dec 13 '13 at 13:41