241

According to MSDN, Median is not available as an aggregate function in Transact-SQL. However, I would like to find out whether it is possible to create this functionality (using the Create Aggregate function, user defined function, or some other method).

What would be the best way (if possible) to do this - allow for the calculation of a median value (assuming a numeric data type) in an aggregate query?

Amira Bedhiafi
  • 6,070
  • 6
  • 17
  • 48
Yaakov Ellis
  • 38,048
  • 29
  • 119
  • 167

33 Answers33

213

If you're using SQL 2005 or better this is a nice, simple-ish median calculation for a single column in a table:

SELECT
(
 (SELECT MAX(Score) FROM
   (SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score) AS BottomHalf)
 +
 (SELECT MIN(Score) FROM
   (SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score DESC) AS TopHalf)
) / 2 AS Median
Jeff Atwood
  • 60,897
  • 45
  • 146
  • 152
  • 67
    That's clever, and relatively simple given that there exists no Median() aggregate function. But how is it that no Median() function exists!? I'm a bit FLOOR()ed, frankly. – Katie Kilian Jan 31 '12 at 19:42
  • Well, nice and simple, but usualy you need median per certain group category, i.e. like `select gid, median(score) from T group by gid`. Do you need a correlated subquery for that? – Tomas Jul 01 '13 at 12:01
  • 1
    ... I mean like in [this case](http://meta.stackexchange.com/a/186808/166308) (the 2nd query named "Users with highest median answer score"). – Tomas Jul 01 '13 at 13:02
  • Tomas - did you manage to solve your "per certain group category" issue plese? As I have the same problem. Thanks. – Stu Harper Dec 13 '13 at 11:37
  • 4
    How to use this solution with a GROUP BY? – Przemyslaw Remin May 13 '15 at 09:32
  • Is this correct for odd number of results? Like for 9 elements median is the fifth one. – Konstantin Chernov Feb 06 '18 at 01:46
  • You can also `UNION` the `MAX` and `MIN` in a CTE, and take the `AVG` of them. – brianary Oct 29 '18 at 22:16
  • If "Score" field is nullable, the median won't compute. You can specify in query: 'SELECT ( (SELECT MAX(Score) FROM (SELECT TOP 50 PERCENT Score FROM Posts Where Score is not null ORDER BY Score) AS BottomHalf) + (SELECT MIN(Score) FROM (SELECT TOP 50 PERCENT Score FROM Posts Where Score is not null ORDER BY Score DESC) AS TopHalf) ) / 2 AS Median' – Hugues Paquet Blanchette Jul 04 '19 at 14:47
162

2019 UPDATE: In the 10 years since I wrote this answer, more solutions have been uncovered that may yield better results. Also, SQL Server releases since then (especially SQL 2012) have introduced new T-SQL features that can be used to calculate medians. SQL Server releases have also improved its query optimizer which may affect perf of various median solutions. Net-net, my original 2009 post is still OK but there may be better solutions on for modern SQL Server apps. Take a look at this article from 2012 which is a great resource: https://sqlperformance.com/2012/08/t-sql-queries/median

This article found the following pattern to be much, much faster than all other alternatives, at least on the simple schema they tested. This solution was 373x faster (!!!) than the slowest (PERCENTILE_CONT) solution tested. Note that this trick requires two separate queries which may not be practical in all cases. It also requires SQL 2012 or later.

DECLARE @c BIGINT = (SELECT COUNT(*) FROM dbo.EvenRows);

SELECT AVG(1.0 * val)
FROM (
    SELECT val FROM dbo.EvenRows
     ORDER BY val
     OFFSET (@c - 1) / 2 ROWS
     FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY
) AS x;

Of course, just because one test on one schema in 2012 yielded great results, your mileage may vary, especially if you're on SQL Server 2014 or later. If perf is important for your median calculation, I'd strongly suggest trying and perf-testing several of the options recommended in that article to make sure that you've found the best one for your schema.

I'd also be especially careful using the (new in SQL Server 2012) function PERCENTILE_CONT that's recommended in one of the other answers to this question, because the article linked above found this built-in function to be 373x slower than the fastest solution. It's possible that this disparity has been improved in the 7 years since, but personally I wouldn't use this function on a large table until I verified its performance vs. other solutions.

ORIGINAL 2009 POST IS BELOW:

There are lots of ways to do this, with dramatically varying performance. Here's one particularly well-optimized solution, from Medians, ROW_NUMBERs, and performance. This is a particularly optimal solution when it comes to actual I/Os generated during execution – it looks more costly than other solutions, but it is actually much faster.

That page also contains a discussion of other solutions and performance testing details. Note the use of a unique column as a disambiguator in case there are multiple rows with the same value of the median column.

As with all database performance scenarios, always try to test a solution out with real data on real hardware – you never know when a change to SQL Server's optimizer or a peculiarity in your environment will make a normally-speedy solution slower.

SELECT
   CustomerId,
   AVG(TotalDue)
FROM
(
   SELECT
      CustomerId,
      TotalDue,
      -- SalesOrderId in the ORDER BY is a disambiguator to break ties
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
   FROM Sales.SalesOrderHeader SOH
) x
WHERE
   RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;
Justin Grant
  • 41,265
  • 11
  • 110
  • 185
  • 12
    I don't think this works if you have dupes, particularly a lot of dupes, in your data. You can't guarantee the row_numbers will line up. You can get some really crazy answers for your median, or even worse, no median at all. – Jonathan Beerhalter Oct 05 '10 at 23:58
  • 26
    That's why having a disambiguator (SalesOrderId in the code example above) is important, so you can ensure that the order of result-set rows is consistent both backwards and forwards. Often a unique primary key makes an ideal disambiguator because it's available without a separate index lookup. If there's no disambiguation column available (for example, if the table has no uniquifying key), then another approach must be used to calculate median, because as you correctly point out, if you can't guarantee that DESC row numbers are mirror images of ASC row numbers, then results are unpredictable. – Justin Grant Oct 06 '10 at 18:24
  • 4
    Thanks, when switching the columns to my DB, I dropped the disambiguator, thinking it wasn't relevant. In that case, this solution works really really well. – Jonathan Beerhalter Oct 10 '10 at 03:10
  • 8
    I suggest adding a comment to the code itself, describing the need for the disambiguator. – hoffmanc May 23 '12 at 14:27
  • 4
    Awesome! long have i known its importance but now i can give it a name... the disambiguator! Thank you Justin! – CodeMonkey Dec 19 '13 at 15:56
  • Without a disambiguator, would it be possible to get a count of each value (count+groupby), then check that `rowAsc` lies in the range `[rowDesc - valueCount / 2, rowDesc + valueCount / 2]`? – BallpointBen Aug 02 '18 at 18:16
  • Got stuck with this. Perfect answer (the original version) but I had a lot of same numbers and no SalesId, or any other unique ID. I found adding this to the original dataset (if you imagine Sales.SalesOrderHeader being a view or a WITH query and not a table) solves the issue (as we're just adding a fake unique ID): `NEWID() AS disambiguator`. – kiradotee Dec 16 '19 at 15:10
  • The "2019 UPDATE" is misleading, because the query and linked resource do not cover grouped medians, while the original answer does. You can't compare these queries directly. – Rudey Jan 18 '21 at 14:06
88

In SQL Server 2012 you should use PERCENTILE_CONT:

SELECT SalesOrderID, OrderQty,
    PERCENTILE_CONT(0.5) 
        WITHIN GROUP (ORDER BY OrderQty)
        OVER (PARTITION BY SalesOrderID) AS MedianCont
FROM Sales.SalesOrderDetail
WHERE SalesOrderID IN (43670, 43669, 43667, 43663)
ORDER BY SalesOrderID DESC

See also : http://blog.sqlauthority.com/2011/11/20/sql-server-introduction-to-percentile_cont-analytic-functions-introduced-in-sql-server-2012/

David Fullerton
  • 3,164
  • 25
  • 37
Simon_Weaver
  • 120,240
  • 73
  • 577
  • 618
  • 12
    This expert analysis makes a compelling argument against the PERCENTILE functions due to poor performance. http://sqlperformance.com/2012/08/t-sql-queries/median – carl.anderson Jan 20 '15 at 18:38
  • 4
    Don't you need to add a `DISTINCT` or `GROUPY BY SalesOrderID`? Otherwise you'll have a lot of duplicate rows. – Konstantin Feb 23 '18 at 15:02
  • 1
    this is the answer. don't know why i had to scroll this far – FistOfFury Apr 05 '19 at 16:14
  • There is also a discreet version using `PERCENTILE_DISC` – johnDanger Nov 20 '19 at 18:50
  • emphasizing @carl.anderson's point above: a PERCENTILE_CONT solution was measured to be 373x slower (!!!!) compared to the fastest solution they tested on SQL Server 2012 on their particular test schema. Read the article that carl linked for more details. – Justin Grant Nov 20 '19 at 23:00
  • The top 50% way is' beautiful_clever' but this method works nicely for when there are multiple groupings – user9930055 Feb 18 '20 at 20:33
21

My original quick answer was:

select  max(my_column) as [my_column], quartile
from    (select my_column, ntile(4) over (order by my_column) as [quartile]
         from   my_table) i
--where quartile = 2
group by quartile

This will give you the median and interquartile range in one fell swoop. If you really only want one row that is the median then uncomment the where clause.

When you stick that into an explain plan, 60% of the work is sorting the data which is unavoidable when calculating position dependent statistics like this.

I've amended the answer to follow the excellent suggestion from Robert Ševčík-Robajz in the comments below:

;with PartitionedData as
  (select my_column, ntile(10) over (order by my_column) as [percentile]
   from   my_table),
MinimaAndMaxima as
  (select  min(my_column) as [low], max(my_column) as [high], percentile
   from    PartitionedData
   group by percentile)
select
  case
    when b.percentile = 10 then cast(b.high as decimal(18,2))
    else cast((a.low + b.high)  as decimal(18,2)) / 2
  end as [value], --b.high, a.low,
  b.percentile
from    MinimaAndMaxima a
  join  MinimaAndMaxima b on (a.percentile -1 = b.percentile) or (a.percentile = 10 and b.percentile = 10)
--where b.percentile = 5

This should calculate the correct median and percentile values when you have an even number of data items. Again, uncomment the final where clause if you only want the median and not the entire percentile distribution.

Sir Wobin
  • 1,060
  • 1
  • 7
  • 12
  • 1
    This actually works pretty well, and allows for partitioning of the data. – Jonathan Beerhalter Oct 06 '10 at 00:26
  • 3
    If it's OK to be off by one, then the query above is fine. But if you need the exact median, then you will have trouble. For example, for the sequence (1,3,5,7) the median is 4 but the query above returns 3. For (1,2,3,503,603,703) the median is 258 but the query above returns 503. – Justin Grant Oct 30 '12 at 06:12
  • 1
    You could fix the flaw of imprecision by taking max and min of each quartile in a subquery, then AVGing the MAX of the previous and MIN of the next? – Rbjz Jul 23 '13 at 14:31
18

Even better:

SELECT @Median = AVG(1.0 * val)
FROM
(
    SELECT o.val, rn = ROW_NUMBER() OVER (ORDER BY o.val), c.c
    FROM dbo.EvenRows AS o
    CROSS JOIN (SELECT c = COUNT(*) FROM dbo.EvenRows) AS c
) AS x
WHERE rn IN ((c + 1)/2, (c + 2)/2);

From the master Himself, Itzik Ben-Gan!

gdoron is supporting Monica
  • 136,782
  • 49
  • 273
  • 342
Alex Gordon
  • 51,480
  • 273
  • 609
  • 976
9

MS SQL Server 2012 (and later) has the PERCENTILE_DISC function which computes a specific percentile for sorted values. PERCENTILE_DISC (0.5) will compute the median - https://msdn.microsoft.com/en-us/library/hh231327.aspx

enkryptor
  • 1,190
  • 1
  • 13
  • 22
4

Simple, fast, accurate

SELECT x.Amount 
FROM   (SELECT amount, 
               Count(1) OVER (partition BY 'A')        AS TotalRows, 
               Row_number() OVER (ORDER BY Amount ASC) AS AmountOrder 
        FROM   facttransaction ft) x 
WHERE  x.AmountOrder = Round(x.TotalRows / 2.0, 0)  
DaveShaw
  • 48,792
  • 16
  • 106
  • 133
Tobbi
  • 61
  • 1
4

If you want to use the Create Aggregate function in SQL Server, this is how to do it. Doing it this way has the benefit of being able to write clean queries. Note this this process could be adapted to calculate a Percentile value fairly easily.

Create a new Visual Studio project and set the target framework to .NET 3.5 (this is for SQL 2008, it may be different in SQL 2012). Then create a class file and put in the following code, or c# equivalent:

Imports Microsoft.SqlServer.Server
Imports System.Data.SqlTypes
Imports System.IO

<Serializable>
<SqlUserDefinedAggregate(Format.UserDefined, IsInvariantToNulls:=True, IsInvariantToDuplicates:=False, _
  IsInvariantToOrder:=True, MaxByteSize:=-1, IsNullIfEmpty:=True)>
Public Class Median
  Implements IBinarySerialize
  Private _items As List(Of Decimal)

  Public Sub Init()
    _items = New List(Of Decimal)()
  End Sub

  Public Sub Accumulate(value As SqlDecimal)
    If Not value.IsNull Then
      _items.Add(value.Value)
    End If
  End Sub

  Public Sub Merge(other As Median)
    If other._items IsNot Nothing Then
      _items.AddRange(other._items)
    End If
  End Sub

  Public Function Terminate() As SqlDecimal
    If _items.Count <> 0 Then
      Dim result As Decimal
      _items = _items.OrderBy(Function(i) i).ToList()
      If _items.Count Mod 2 = 0 Then
        result = ((_items((_items.Count / 2) - 1)) + (_items(_items.Count / 2))) / 2@
      Else
        result = _items((_items.Count - 1) / 2)
      End If

      Return New SqlDecimal(result)
    Else
      Return New SqlDecimal()
    End If
  End Function

  Public Sub Read(r As BinaryReader) Implements IBinarySerialize.Read
    'deserialize it from a string
    Dim list = r.ReadString()
    _items = New List(Of Decimal)

    For Each value In list.Split(","c)
      Dim number As Decimal
      If Decimal.TryParse(value, number) Then
        _items.Add(number)
      End If
    Next

  End Sub

  Public Sub Write(w As BinaryWriter) Implements IBinarySerialize.Write
    'serialize the list to a string
    Dim list = ""

    For Each item In _items
      If list <> "" Then
        list += ","
      End If      
      list += item.ToString()
    Next
    w.Write(list)
  End Sub
End Class

Then compile it and copy the DLL and PDB file to your SQL Server machine and run the following command in SQL Server:

CREATE ASSEMBLY CustomAggregate FROM '{path to your DLL}'
WITH PERMISSION_SET=SAFE;
GO

CREATE AGGREGATE Median(@value decimal(9, 3))
RETURNS decimal(9, 3) 
EXTERNAL NAME [CustomAggregate].[{namespace of your DLL}.Median];
GO

You can then write a query to calculate the median like this: SELECT dbo.Median(Field) FROM Table

Rono
  • 2,575
  • 2
  • 24
  • 49
3

The following query returns the median from a list of values in one column. It cannot be used as or along with an aggregate function, but you can still use it as a sub-query with a WHERE clause in the inner select.

SQL Server 2005+:

SELECT TOP 1 value from
(
    SELECT TOP 50 PERCENT value 
    FROM table_name 
    ORDER BY  value
)for_median
ORDER BY value DESC
PyQL
  • 1,762
  • 3
  • 14
  • 20
3

Although Justin grant's solution appears solid I found that when you have a number of duplicate values within a given partition key the row numbers for the ASC duplicate values end up out of sequence so they do not properly align.

Here is a fragment from my result:

KEY VALUE ROWA ROWD  

13  2     22   182
13  1     6    183
13  1     7    184
13  1     8    185
13  1     9    186
13  1     10   187
13  1     11   188
13  1     12   189
13  0     1    190
13  0     2    191
13  0     3    192
13  0     4    193
13  0     5    194

I used Justin's code as the basis for this solution. Although not as efficient given the use of multiple derived tables it does resolve the row ordering problem I encountered. Any improvements would be welcome as I am not that experienced in T-SQL.

SELECT PKEY, cast(AVG(VALUE)as decimal(5,2)) as MEDIANVALUE
FROM
(
  SELECT PKEY,VALUE,ROWA,ROWD,
  'FLAG' = (CASE WHEN ROWA IN (ROWD,ROWD-1,ROWD+1) THEN 1 ELSE 0 END)
  FROM
  (
    SELECT
    PKEY,
    cast(VALUE as decimal(5,2)) as VALUE,
    ROWA,
    ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY ROWA DESC) as ROWD 

    FROM
    (
      SELECT
      PKEY, 
      VALUE,
      ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY VALUE ASC,PKEY ASC ) as ROWA 
      FROM [MTEST]
    )T1
  )T2
)T3
WHERE FLAG = '1'
GROUP BY PKEY
ORDER BY PKEY
C B
  • 1,637
  • 6
  • 16
  • 20
3

I just came across this page while looking for a set based solution to median. After looking at some of the solutions here, I came up with the following. Hope is helps/works.

DECLARE @test TABLE(
    i int identity(1,1),
    id int,
    score float
)

INSERT INTO @test (id,score) VALUES (1,10)
INSERT INTO @test (id,score) VALUES (1,11)
INSERT INTO @test (id,score) VALUES (1,15)
INSERT INTO @test (id,score) VALUES (1,19)
INSERT INTO @test (id,score) VALUES (1,20)

INSERT INTO @test (id,score) VALUES (2,20)
INSERT INTO @test (id,score) VALUES (2,21)
INSERT INTO @test (id,score) VALUES (2,25)
INSERT INTO @test (id,score) VALUES (2,29)
INSERT INTO @test (id,score) VALUES (2,30)

INSERT INTO @test (id,score) VALUES (3,20)
INSERT INTO @test (id,score) VALUES (3,21)
INSERT INTO @test (id,score) VALUES (3,25)
INSERT INTO @test (id,score) VALUES (3,29)

DECLARE @counts TABLE(
    id int,
    cnt int
)

INSERT INTO @counts (
    id,
    cnt
)
SELECT
    id,
    COUNT(*)
FROM
    @test
GROUP BY
    id

SELECT
    drv.id,
    drv.start,
    AVG(t.score)
FROM
    (
        SELECT
            MIN(t.i)-1 AS start,
            t.id
        FROM
            @test t
        GROUP BY
            t.id
    ) drv
    INNER JOIN @test t ON drv.id = t.id
    INNER JOIN @counts c ON t.id = c.id
WHERE
    t.i = ((c.cnt+1)/2)+drv.start
    OR (
        t.i = (((c.cnt+1)%2) * ((c.cnt+2)/2))+drv.start
        AND ((c.cnt+1)%2) * ((c.cnt+2)/2) <> 0
    )
GROUP BY
    drv.id,
    drv.start
forsvarir
  • 10,243
  • 6
  • 38
  • 70
brian
  • 31
  • 1
2

In a UDF, write:

 Select Top 1 medianSortColumn from Table T
  Where (Select Count(*) from Table
         Where MedianSortColumn <
           (Select Count(*) From Table) / 2)
  Order By medianSortColumn
Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Charles Bretana
  • 131,932
  • 22
  • 140
  • 207
2

Justin's example above is very good. But that Primary key need should be stated very clearly. I have seen that code in the wild without the key and the results are bad.

The complaint I get about the Percentile_Cont is that it wont give you an actual value from the dataset. To get to a "median" that is an actual value from the dataset use Percentile_Disc.

SELECT SalesOrderID, OrderQty,
    PERCENTILE_DISC(0.5) 
        WITHIN GROUP (ORDER BY OrderQty)
        OVER (PARTITION BY SalesOrderID) AS MedianCont
FROM Sales.SalesOrderDetail
WHERE SalesOrderID IN (43670, 43669, 43667, 43663)
ORDER BY SalesOrderID DESC
2

Median Finding

This is the simplest method to find the median of an attribute.

Select round(S.salary,4) median from employee S 
where (select count(salary) from station 
where salary < S.salary ) = (select count(salary) from station
where salary > S.salary)
Nivesh Krishna
  • 123
  • 1
  • 5
1

See other solutions for median calculation in SQL here: "Simple way to calculate median with MySQL" (the solutions are mostly vendor-independent).

Community
  • 1
  • 1
Bill Karwin
  • 462,430
  • 80
  • 609
  • 762
1

For a continuous variable/measure 'col1' from 'table1'

select col1  
from
    (select top 50 percent col1, 
    ROW_NUMBER() OVER(ORDER BY col1 ASC) AS Rowa,
    ROW_NUMBER() OVER(ORDER BY col1 DESC) AS Rowd
    from table1 ) tmp
where tmp.Rowa = tmp.Rowd
1

Frequently, we may need to calculate Median not just for the whole table, but for aggregates with respect to some ID. In other words, calculate median for each ID in our table, where each ID has many records. (based on the solution edited by @gdoron: good performance and works in many SQL)

SELECT our_id, AVG(1.0 * our_val) as Median
FROM
( SELECT our_id, our_val, 
  COUNT(*) OVER (PARTITION BY our_id) AS cnt,
  ROW_NUMBER() OVER (PARTITION BY our_id ORDER BY our_val) AS rnk
  FROM our_table
) AS x
WHERE rnk IN ((cnt + 1)/2, (cnt + 2)/2) GROUP BY our_id;

Hope it helps.

1

Using a single statement - One way is to use ROW_NUMBER(), COUNT() window function and filter the sub-query. Here is to find the median salary:

 SELECT AVG(e_salary) 
 FROM                                                             
    (SELECT 
      ROW_NUMBER() OVER(ORDER BY e_salary) as row_no, 
      e_salary,
      (COUNT(*) OVER()+1)*0.5 AS row_half
     FROM Employee) t
 WHERE row_no IN (FLOOR(row_half),CEILING(row_half))

I have seen similar solutions over the net using FLOOR and CEILING but tried to use a single statement. (edited)

0

I wanted to work out a solution by myself, but my brain tripped and fell on the way. I think it works, but don't ask me to explain it in the morning. :P

DECLARE @table AS TABLE
(
    Number int not null
);

insert into @table select 2;
insert into @table select 4;
insert into @table select 9;
insert into @table select 15;
insert into @table select 22;
insert into @table select 26;
insert into @table select 37;
insert into @table select 49;

DECLARE @Count AS INT
SELECT @Count = COUNT(*) FROM @table;

WITH MyResults(RowNo, Number) AS
(
    SELECT RowNo, Number FROM
        (SELECT ROW_NUMBER() OVER (ORDER BY Number) AS RowNo, Number FROM @table) AS Foo
)
SELECT AVG(Number) FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2)
Gavin
  • 9,497
  • 7
  • 45
  • 60
0
--Create Temp Table to Store Results in
DECLARE @results AS TABLE 
(
    [Month] datetime not null
 ,[Median] int not null
);

--This variable will determine the date
DECLARE @IntDate as int 
set @IntDate = -13


WHILE (@IntDate < 0) 
BEGIN

--Create Temp Table
DECLARE @table AS TABLE 
(
    [Rank] int not null
 ,[Days Open] int not null
);

--Insert records into Temp Table
insert into @table 

SELECT 
    rank() OVER (ORDER BY DATEADD(mm, DATEDIFF(mm, 0, DATEADD(ss, SVR.close_date, '1970')), 0), DATEDIFF(day,DATEADD(ss, SVR.open_date, '1970'),DATEADD(ss, SVR.close_date, '1970')),[SVR].[ref_num]) as [Rank]
 ,DATEDIFF(day,DATEADD(ss, SVR.open_date, '1970'),DATEADD(ss, SVR.close_date, '1970')) as [Days Open]
FROM
 mdbrpt.dbo.View_Request SVR
 LEFT OUTER JOIN dbo.dtv_apps_systems vapp 
 on SVR.category = vapp.persid
 LEFT OUTER JOIN dbo.prob_ctg pctg 
 on SVR.category = pctg.persid
 Left Outer Join [mdbrpt].[dbo].[rootcause] as [Root Cause] 
 on [SVR].[rootcause]=[Root Cause].[id]
 Left Outer Join [mdbrpt].[dbo].[cr_stat] as [Status]
 on [SVR].[status]=[Status].[code]
 LEFT OUTER JOIN [mdbrpt].[dbo].[net_res] as [net] 
 on [net].[id]=SVR.[affected_rc]
WHERE
 SVR.Type IN ('P') 
 AND
 SVR.close_date IS NOT NULL 
 AND
 [Status].[SYM] = 'Closed'
 AND
 SVR.parent is null
 AND
 [Root Cause].[sym] in ( 'RC - Application','RC - Hardware', 'RC - Operational', 'RC - Unknown')
 AND
 (
  [vapp].[appl_name] in ('3PI','Billing Rpts/Files','Collabrent','Reports','STMS','STMS 2','Telco','Comergent','OOM','C3-BAU','C3-DD','DIRECTV','DIRECTV Sales','DIRECTV Self Care','Dealer Website','EI Servlet','Enterprise Integration','ET','ICAN','ODS','SB-SCM','SeeBeyond','Digital Dashboard','IVR','OMS','Order Services','Retail Services','OSCAR','SAP','CTI','RIO','RIO Call Center','RIO Field Services','FSS-RIO3','TAOS','TCS')
 OR
  pctg.sym in ('Systems.Release Health Dashboard.Problem','DTV QA Test.Enterprise Release.Deferred Defect Log')
 AND  
  [Net].[nr_desc] in ('3PI','Billing Rpts/Files','Collabrent','Reports','STMS','STMS 2','Telco','Comergent','OOM','C3-BAU','C3-DD','DIRECTV','DIRECTV Sales','DIRECTV Self Care','Dealer Website','EI Servlet','Enterprise Integration','ET','ICAN','ODS','SB-SCM','SeeBeyond','Digital Dashboard','IVR','OMS','Order Services','Retail Services','OSCAR','SAP','CTI','RIO','RIO Call Center','RIO Field Services','FSS-RIO3','TAOS','TCS')
 )
 AND
 DATEADD(mm, DATEDIFF(mm, 0, DATEADD(ss, SVR.close_date, '1970')), 0) = DATEADD(mm, DATEDIFF(mm,0,DATEADD(mm,@IntDate,getdate())), 0)
ORDER BY [Days Open]



DECLARE @Count AS INT
SELECT @Count = COUNT(*) FROM @table;

WITH MyResults(RowNo, [Days Open]) AS
(
    SELECT RowNo, [Days Open] FROM
        (SELECT ROW_NUMBER() OVER (ORDER BY [Days Open]) AS RowNo, [Days Open] FROM @table) AS Foo
)


insert into @results
SELECT 
 DATEADD(mm, DATEDIFF(mm,0,DATEADD(mm,@IntDate,getdate())), 0) as [Month]
 ,AVG([Days Open])as [Median] FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2) 


set @IntDate = @IntDate+1
DELETE FROM @table
END

select *
from @results
order by [Month]
Jeff Atwood
  • 60,897
  • 45
  • 146
  • 152
0

For newbies like myself who are learning the very basics, I personally find this example easier to follow, as it is easier to understand exactly what's happening and where median values are coming from...

select
 ( max(a.[Value1]) + min(a.[Value1]) ) / 2 as [Median Value1]
,( max(a.[Value2]) + min(a.[Value2]) ) / 2 as [Median Value2]

from (select
    datediff(dd,startdate,enddate) as [Value1]
    ,xxxxxxxxxxxxxx as [Value2]
     from dbo.table1
     )a

In absolute awe of some of the codes above though!!!

0

This is as simple an answer as I could come up with. Worked well with my data. If you want to exclude certain values just add a where clause to the inner select.

SELECT TOP 1 
    ValueField AS MedianValue
FROM
    (SELECT TOP(SELECT COUNT(1)/2 FROM tTABLE)
        ValueField
    FROM 
        tTABLE
    ORDER BY 
        ValueField) A
ORDER BY
    ValueField DESC
0

The following solution works under these assumptions:

  • No duplicate values
  • No NULLs

Code:

IF OBJECT_ID('dbo.R', 'U') IS NOT NULL
  DROP TABLE dbo.R

CREATE TABLE R (
    A FLOAT NOT NULL);

INSERT INTO R VALUES (1);
INSERT INTO R VALUES (2);
INSERT INTO R VALUES (3);
INSERT INTO R VALUES (4);
INSERT INTO R VALUES (5);
INSERT INTO R VALUES (6);

-- Returns Median(R)
select SUM(A) / CAST(COUNT(A) AS FLOAT)
from R R1 
where ((select count(A) from R R2 where R1.A > R2.A) = 
      (select count(A) from R R2 where R1.A < R2.A)) OR
      ((select count(A) from R R2 where R1.A > R2.A) + 1 = 
      (select count(A) from R R2 where R1.A < R2.A)) OR
      ((select count(A) from R R2 where R1.A > R2.A) = 
      (select count(A) from R R2 where R1.A < R2.A) + 1) ; 
Maria Ines Parnisari
  • 14,229
  • 7
  • 73
  • 112
0
DECLARE @Obs int
DECLARE @RowAsc table
(
ID      INT IDENTITY,
Observation  FLOAT
)
INSERT INTO @RowAsc
SELECT Observations FROM MyTable
ORDER BY 1 
SELECT @Obs=COUNT(*)/2 FROM @RowAsc
SELECT Observation AS Median FROM @RowAsc WHERE ID=@Obs
0

For large scale datasets, you can try this GIST:

https://gist.github.com/chrisknoll/1b38761ce8c5016ec5b2

It works by aggregating the distinct values you would find in your set (such as ages, or year of birth, etc.), and uses SQL window functions to locate any percentile position you specify in the query.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Chris Knoll
  • 326
  • 2
  • 12
0

I try with several alternatives, but due my data records has repeated values, the ROW_NUMBER versions seems are not a choice for me. So here the query I used (a version with NTILE):

SELECT distinct
   CustomerId,
   (
       MAX(CASE WHEN Percent50_Asc=1 THEN TotalDue END) OVER (PARTITION BY CustomerId)  +
       MIN(CASE WHEN Percent50_desc=1 THEN TotalDue END) OVER (PARTITION BY CustomerId) 
   )/2 MEDIAN
FROM
(
   SELECT
      CustomerId,
      TotalDue,
     NTILE(2) OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue ASC) AS Percent50_Asc,
     NTILE(2) OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue DESC) AS Percent50_desc
   FROM Sales.SalesOrderHeader SOH
) x
ORDER BY CustomerId;
Galled
  • 3,958
  • 2
  • 25
  • 41
0

Building on Jeff Atwood's answer above here it is with GROUP BY and a correlated subquery to get the median for each group.

SELECT TestID, 
(
 (SELECT MAX(Score) FROM
   (SELECT TOP 50 PERCENT Score FROM Posts WHERE TestID = Posts_parent.TestID ORDER BY Score) AS BottomHalf)
 +
 (SELECT MIN(Score) FROM
   (SELECT TOP 50 PERCENT Score FROM Posts WHERE TestID = Posts_parent.TestID ORDER BY Score DESC) AS TopHalf)
) / 2 AS MedianScore,
AVG(Score) AS AvgScore, MIN(Score) AS MinScore, MAX(Score) AS MaxScore
FROM Posts_parent
GROUP BY Posts_parent.TestID
Jim B
  • 362
  • 2
  • 9
0

For your question, Jeff Atwood had already given the simple and effective solution. But, if you are looking for some alternative approach to calculate the median, below SQL code will help you.

create table employees(salary int);

insert into employees values(8); insert into employees values(23); insert into employees values(45); insert into employees values(123); insert into employees values(93); insert into employees values(2342); insert into employees values(2238);

select * from employees;

declare @odd_even int; declare @cnt int; declare @middle_no int;


set @cnt=(select count(*) from employees); set @middle_no=(@cnt/2)+1; select @odd_even=case when (@cnt%2=0) THEN -1 ELse 0 END ;


 select AVG(tbl.salary) from  (select  salary,ROW_NUMBER() over (order by salary) as rno from employees group by salary) tbl  where tbl.rno=@middle_no or tbl.rno=@middle_no+@odd_even;

If you are looking to calculate median in MySQL, this github link will be useful.

0

This is the most optimal solution for finding medians that I can think of. The names in the example is based on Justin example. Make sure an index for table Sales.SalesOrderHeader exists with index columns CustomerId and TotalDue in that order.

SELECT
 sohCount.CustomerId,
 AVG(sohMid.TotalDue) as TotalDueMedian
FROM 
(SELECT 
  soh.CustomerId,
  COUNT(*) as NumberOfRows
FROM 
  Sales.SalesOrderHeader soh 
GROUP BY soh.CustomerId) As sohCount
CROSS APPLY 
    (Select 
       soh.TotalDue
    FROM 
    Sales.SalesOrderHeader soh 
    WHERE soh.CustomerId = sohCount.CustomerId 
    ORDER BY soh.TotalDue
    OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS 
    FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
    ) As sohMid
GROUP BY sohCount.CustomerId

UPDATE

I was a bit unsure about which method has best performance, so I did a comparison between my method Justin Grants and Jeff Atwoods by running query based on all three methods in one batch and the batch cost of each query were:

Without index:

  • Mine 30%
  • Justin Grants 13%
  • Jeff Atwoods 58%

And with index

  • Mine 3%.
  • Justin Grants 10%
  • Jeff Atwoods 87%

I tried to see how well the queries scale if you have index by creating more data from around 14 000 rows by a factor of 2 up to 512 which means in the end around 7,2 millions rows. Note I made sure CustomeId field where unique for each time I did a single copy, so the proportion of rows compared to unique instance of CustomerId was kept constant. While I was doing this I ran executions where I rebuilt index afterwards, and I noticed the results stabilized at around a factor of 128 with the data I had to these values:

  • Mine 3%.
  • Justin Grants 5%
  • Jeff Atwoods 92%

I wondered how the performance could have been affected by scaling number of of rows but keeping unique CustomerId constant, so I setup a new test where I did just this. Now instead of stabilizing, the batch cost ratio kept diverging, also instead of about 20 rows per CustomerId per average I had in the end around 10000 rows per such unique Id. The numbers where:

  • Mine 4%
  • Justins 60%
  • Jeffs 35%

I made sure I implemented each method correct by comparing the results. My conclusion is the method I used is generally faster as long as index exists. Also noticed that this method is what's recommended for this particular problem in this article https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5

A way to even further improve performance of subsequent calls to this query even further is to persist the count information in an auxiliary table. You could even maintain it by having a trigger that update and holds information regarding the count of SalesOrderHeader rows dependant on CustomerId, of course you can then simple store the median as well.

Kaveh Hadjari
  • 217
  • 1
  • 10
0

Below is My solution:

with tempa as

 (

    select value,row_number() over (order by value) as Rn,/* Assigning a 
                                                           row_number */
           count(value) over () as Cnt /*Taking total count of the values */
    from numbers
    where value is not null /* Excluding the null values */
 ),

tempb as

  (

    /* Since we don't know whether the number of rows is odd or even, we shall 
     consider both the scenarios */

    select round(cnt/2) as Ref from tempa where mod(cnt,2)=1
    union all
    select round(cnt/2) a Ref from tempa where mod(cnt,2)=0
     union all
    select round(cnt/2) + 1 as Ref from tempa where mod(cnt,2)=0
   )
  select avg(value) as Median_Value

  from tempa where rn in

    ( select Ref from tempb);

mybrave
  • 1,285
  • 3
  • 14
  • 32
0

Using the COUNT aggregate, you can first count how many rows there are and store in a variable called @cnt. Then you can compute parameters for the OFFSET-FETCH filter to specify, based on qty ordering, how many rows to skip (offset value) and how many to filter (fetch value).

The number of rows to skip is (@cnt - 1) / 2. It’s clear that for an odd count this calculation is correct because you first subtract 1 for the single middle value, before you divide by 2.

This also works correctly for an even count because the division used in the expression is integer division; so, when subtracting 1 from an even count, you’re left with an odd value.

When dividing that odd value by 2, the fraction part of the result (.5) is truncated. The number of rows to fetch is 2 - (@cnt % 2). The idea is that when the count is odd the result of the modulo operation is 1, and you need to fetch 1 row. When the count is even the result of the modulo operation is 0, and you need to fetch 2 rows. By subtracting the 1 or 0 result of the modulo operation from 2, you get the desired 1 or 2, respectively. Finally, to compute the median quantity, take the one or two result quantities, and apply an average after converting the input integer value to a numeric one as follows:

DECLARE @cnt AS INT = (SELECT COUNT(*) FROM [Sales].[production].[stocks]);
SELECT AVG(1.0 * quantity) AS median
FROM ( SELECT quantity
FROM [Sales].[production].[stocks]
ORDER BY quantity
OFFSET (@cnt - 1) / 2 ROWS FETCH NEXT 2 - @cnt % 2 ROWS ONLY ) AS D;
Amira Bedhiafi
  • 6,070
  • 6
  • 17
  • 48
0

This works with SQL 2000:

DECLARE @testTable TABLE 
( 
    VALUE   INT
)
--INSERT INTO @testTable -- Even Test
--SELECT 3 UNION ALL
--SELECT 5 UNION ALL
--SELECT 7 UNION ALL
--SELECT 12 UNION ALL
--SELECT 13 UNION ALL
--SELECT 14 UNION ALL
--SELECT 21 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 29 UNION ALL
--SELECT 40 UNION ALL
--SELECT 56

--
--INSERT INTO @testTable -- Odd Test
--SELECT 3 UNION ALL
--SELECT 5 UNION ALL
--SELECT 7 UNION ALL
--SELECT 12 UNION ALL
--SELECT 13 UNION ALL
--SELECT 14 UNION ALL
--SELECT 21 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 29 UNION ALL
--SELECT 39 UNION ALL
--SELECT 40 UNION ALL
--SELECT 56


DECLARE @RowAsc TABLE
(
    ID      INT IDENTITY,
    Amount  INT
)

INSERT INTO @RowAsc
SELECT  VALUE 
FROM    @testTable 
ORDER BY VALUE ASC

SELECT  AVG(amount)
FROM @RowAsc ra
WHERE ra.id IN
(
    SELECT  ID 
    FROM    @RowAsc
    WHERE   ra.id -
    (
        SELECT  MAX(id) / 2.0 
        FROM    @RowAsc
    ) BETWEEN 0 AND 1

)
SQLMason
  • 3,190
  • 1
  • 27
  • 39
-2

Try the below logic to find out the median:

Consider a table with the below numbers: 1,1,2,3,4,5

THE MEDIAN is 2.5

with tempa as ( select num,count(num) over() as Cnt, row_number() over (order by num) as Rnum from temp), tempb as ( select round(cnt/2) as ref_value from tempa where mod(cnt,2)<>0 union all select round(cnt/2) from tempa where mod(cnt,2)=0 union all select round(cnt/2+1) from tempa where mod(cnt,2)=0 ) select avg(num) from tempa where rnum in (select * from tempb);