2

I need to calculate the median number of days between the oppo_opened date and the oppo_offered date on a table called Opportunity, grouping results by a joined table Company. Doing the average is very easy:

----Average Opened to Offered last 12 months
Select Comp_Name,AVG(DATEDIFF(day,Oppo_Opened,oppo_offerDate)) as averageDTO from company join Opportunity on oppo_lender = Comp_CompanyId
where DATEADD(Year,-1,GETDATE()) > Oppo_Opened AND oppo_offerDate is not null
group by Comp_Name order by averageDTO desc

The above returns one row per company, as we are grouping by the Comp_Name.

It's not quite so easy when doing this for a median. I located the following post:

Function to Calculate Median in Sql Server

And that got me to here:

SELECT     Comp_Name,
    PERCENTILE_CONT(0.5) 
        WITHIN GROUP (ORDER BY DATEDIFF(day,Oppo_Opened,oppo_offerDate))
        OVER (PARTITION BY Comp_Name) AS MedianCont
from company inner join Opportunity on oppo_lender = Comp_CompanyId
where DATEADD(Year,-1,GETDATE()) > Oppo_Opened AND oppo_offerDate is not null
order by MedianCont desc

However, this returns one row per Opportunity record and not one row per Company record. I can't do a simple group by Comp_Name on this either as we then get:

Msg 8120, Level 16, State 1, Line 3

Column 'Opportunity.Oppo_Opened' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Msg 8120, Level 16, State 1, Line 3

Column 'Opportunity.oppo_offerDate' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

And putting an aggregate method on either of the above I am presuming will screw up my median calculation.

My Question: How can I group by the PERCENTILE_CONT partition to get one row per Company (I should get 209 rows) as opposed to one row per Opportunity (north of 30k rows)?

Community
  • 1
  • 1
MagicalArmchair
  • 891
  • 1
  • 10
  • 24
  • 2
    Did you consider using the [accepted answer](http://stackoverflow.com/a/1567946/243373) in the SO thread you posted? The performance of PERCENTILE_CONT vs the accepted answer in the link [has been compared](http://sqlperformance.com/2012/08/t-sql-queries/median) and PERCENTILE_CONT was found to be poorly performing. – TT. Jan 22 '16 at 10:20
  • Also if you really want to find the median value using a PERCENTILE you should be using the [PERCENTILE_DISC](http://stackoverflow.com/a/25703108/243373) function. From the [documentation on PERCENTILE_CONT](https://msdn.microsoft.com/en-us/library/hh231473%28v=sql.110%29.aspx): `The result is interpolated and might not be equal to any of the specific values in the column.`. From [PERCENTILE_DISC documentation](https://msdn.microsoft.com/en-us/library/hh231327.aspx): `PERCENTILE_DISC (0.5) will compute the 50th percentile (that is, the median) of an expression.`. – TT. Jan 22 '16 at 10:27

1 Answers1

1

I've taken a different approach, which will allow me to highlight and explain how the median is calculated.

I should say that the previous question you looked at contains a better discussion on the performance and relative merits of different median calculations. This method probably isn't the most efficient. However I hope it will be easier to follow and then modify for your needs.

My example use the data defined below. There are two companies A and B. A has a median of 4 and B has a median of 5.

Sample Data

/* Using a table variable to create shared
 * sample data.
 */
DECLARE @Sample TABLE
    (
        Company     VARCHAR(1),
        [Value]     INT
    )
;

INSERT INTO @Sample
    (
        Company,
        [Value]
    )
VALUES
    ('A', 2),   -- Median value for Company A: {2, 4, 6} = 4.
    ('A', 4),       
    ('A', 6),
    ('B', 2),   -- Median records for Company B: {2, 4, 6, 8} = {4, 6}.
    ('B', 4),   -- Median value = {4, 6} / 2 = 5.   
    ('B', 6),       
    ('B', 8)
;

I've based my median calculation on this Wikipedia artical. For any set with an odd number of records I've taken the middle entry, when sorted by value. For sets with an even number of records I've taken the two middle records, again sorted by value, and averaged the result.

Examples

Odd

{3, 1, 2} = {1, 2, 3}        Sort.
{1, 2, 3} = 2                Find middle entry.
2                            Median.

Even

{4, 3, 1, 2} = {1, 2, 3, 4}  Sort.
{1, 2, 3, 4} = {2, 3}        Find middle values.
{2, 3} / 2 = 2.5             Average.
2.5                          Median.

Query

SELECT  
    r.Company,
    AVG(r.[Value])
FROM
    (
        /* You cannot use windowed functions directly in a WHERE clause.
         * This subquery makes those fields available to all clauses in the 
         * outer query.
         */
        SELECT
            ROW_NUMBER() OVER (PARTITION BY s.Company ORDER BY s.[Value])           AS RecordNumber,
            COUNT(*) OVER (PARTITION BY s.Company)                                  AS CompanyRecordCount,
            CAST((COUNT(*) OVER (PARTITION BY s.Company)) AS DECIMAL(9, 1)) / 2.0   AS MedianPoint,
            COUNT(*) OVER (PARTITION BY s.Company)  % 2                             AS IsOdd,
            s.Company,
            s.[Value]
        FROM
            @Sample AS s
    ) AS r
WHERE
    (   
        -- When company has odd number of records median is the middle record.
        r.IsOdd = 1
        AND r.RecordNumber = CEILING(MedianPoint)
    )       
    OR
    (
        -- When company has even number of records median is avg of two middle records.
        r.IsOdd = 0
        AND r.RecordNumber IN (MedianPoint, MedianPoint + 1)
    )
GROUP BY
    r.Company
;

I've used ROW_NUMBER to number each record, within each company, sorted by value (RecordNumber).

I've combined COUNT with the OVER clause to return the number of records for each company (CompanyRecordCount).

I've divided the number of records for each company by two, to find the median point (MedianPoint). We will use this to filter for the median records later. I have also calculated the remainder (IsOdd). This will be used to determine which calculation to apply.

The WHERE clause filters odd records for the CEILING of the MedianPoint. Ceiling rounds up to the nearest integer, and in the case of odd numbers this is always the record you require. If you have 3 records the median point is 1.5 and you want record #2.

For even records the WHERE clause returns the median record, and the next in the sequence. 4 records has a median point of 2. We want records 2 and 3.

Finally the outer query averages the result. In the case of odd records there is only value to average, so it remains unchanged. For evens there are two and the median is calculated dividing one by the other.

Community
  • 1
  • 1
David Rushton
  • 4,657
  • 1
  • 15
  • 30