-1

There's no median function in sql server, so I'm using this wonderful suggestion:

https://stackoverflow.com/a/2026609/117700

this computes the median over an entire dataset, but I need the median per record.

My dataset is:

+-----------+-------------+
| client_id | TimesTested |
+-----------+-------------+
|    214220 |           1 |
|    215425 |           1 |
|    212839 |           4 |
|    215249 |           1 |
|    210498 |           3 |
|    110655 |           1 |
|    110655 |           1 |
|    110655 |          12 |
|    215425 |           4 |
|    100196 |           1 |
|    110032 |           1 |
|    110032 |           1 |
|    101944 |           3 |
|    101232 |           2 |
|    101232 |           1 |
+-----------+-------------+

here's the query I am using:

select client_id,  
    (
    SELECT
    (
     (SELECT MAX(TimesTested ) FROM
       (SELECT TOP 50 PERCENT t.TimesTested 
       FROM counted3 t 
       where t.timestested>1 
       and CLIENT_ID=t.CLIENT_ID 
       ORDER BY t.TimesTested ) AS BottomHalf)
     +
     (SELECT MIN(TimesTested ) FROM
       (SELECT TOP 50 PERCENT t.TimesTested 
       FROM counted3 t 
       where t.timestested>1 
       and CLIENT_ID=t.CLIENT_ID 
       ORDER BY t.TimesTested DESC) AS TopHalf)
    ) / 2 AS Median
    ) TotalAvgTestFreq
from counted3 

group by client_id

but it is giving my funny data:

+-----------+------------------+
| client_id | median???????????|
+-----------+------------------+
|    100007 |               84 |
|    100008 |               84 |
|    100011 |               84 |
|    100014 |               84 |
|    100026 |               84 |
|    100027 |               84 |
|    100028 |               84 |
|    100029 |               84 |
|    100042 |               84 |
|    100043 |               84 |
|    100071 |               84 |
|    100072 |               84 |
|    100074 |               84 |
+-----------+------------------+

i can i get the median for every client_id ?

I am currently trying to use this awesome query from Aaron's site:

select c3.client_id,(
    SELECT AVG(1.0 * TimesTested ) median
    FROM
    (
        SELECT o.TimesTested , 
        rn = ROW_NUMBER() OVER (ORDER BY o.TimesTested ), c.c
        FROM counted3 AS o
        CROSS JOIN (SELECT c = COUNT(*) FROM counted3) AS c
        where count>1
    ) AS x
    WHERE rn IN ((c + 1)/2, (c + 2)/2)
    ) a
    from counted3 c3
    group by c3.client_id

unfortunately, as Richardthekiwi points out:

it's for a single median whereas this question is about a median per-partition

i would like to know how i can join it on counted3 to get the median per partition?>

Community
  • 1
  • 1
Alex Gordon
  • 51,480
  • 273
  • 609
  • 976
  • What do you mean, "median per record"? – Lamak Oct 29 '12 at 18:02
  • @Lamak i mean for all the values for client_id 100007 i want to get the median – Alex Gordon Oct 29 '12 at 18:02
  • if 100007 has timestested 1, 1,1,1,2,4,5,7,8,99 then the median woudl be 4 – Alex Gordon Oct 29 '12 at 18:03
  • Ok, Then you should update your dataset since it shows unique `client_id`s and they are different from your result set – Lamak Oct 29 '12 at 18:05
  • There are many ways, I profiled the fastest here (hint: it's not the approach you're using): http://www.sqlperformance.com/2012/08/t-sql-queries/median – Aaron Bertrand Oct 29 '12 at 18:13
  • On the inner most queries, you say: ` and CLIENT_ID=t.CLIENT_ID ` which is the same as ` and t.CLIENT_ID=t.CLIENT_ID ` - in other words, a non-sensical join condition. You need to reference the outer table as Sebastien has indicated. Also, why not include sample data in your question that actually matches your results? Why are the client_ids not the same? – Aaron Bertrand Oct 29 '12 at 18:21
  • @AaronBertrand - I thought op said that he already fixed the sample and results data, but alas, I didn't check – Lamak Oct 29 '12 at 18:23
  • Nope. The lowest `client_id` in your "dataset" is 100196. The highest `client_id` in your "funny data" is 100074. – Aaron Bertrand Oct 29 '12 at 18:27
  • @AaronBertrand i used the query on your site by the israeli guy. it's very fast, but returned ALL NULLS – Alex Gordon Oct 29 '12 at 19:16
  • this one SELECT @Median = AVG(1.0 * val) FROM ( SELECT o.val, rn = ROW_NUMBER() OVER (ORDER BY o.val), c.c FROM dbo.EvenRows AS o CROSS JOIN (SELECT c = COUNT(*) FROM dbo.EvenRows) AS c ) AS x WHERE rn IN ((c + 1)/2, (c + 2)/2); – Alex Gordon Oct 29 '12 at 19:19
  • 1
    @Aaron - while your blog is interesting, it's for a single median whereas this question is about a median *per-partition* (client_id). – RichardTheKiwi Oct 29 '12 at 19:20
  • 1
    @RichardTheKiwi while true, you can't just take the code from the blog, I expect people who are programming for a living to be able to extend what we've done for them to some degree. You'll also note that I didn't post it as an *answer*. Therefore I was not stating or even implying that I had code to provide a complete answer - just something to look at, not to be nit-picked. – Aaron Bertrand Oct 29 '12 at 19:39

2 Answers2

1

try this:

select client_id,  
    (
    SELECT
    (
     (SELECT MAX(testfreq) FROM
       (SELECT TOP 50 PERCENT t.testfreq 
       FROM counted3 t 
       where t.timestested>1 
       and c3.CLIENT_ID=t.CLIENT_ID 
       ORDER BY t.testfreq) AS BottomHalf)
     +
     (SELECT MIN(testfreq) FROM
       (SELECT TOP 50 PERCENT t.testfreq 
       FROM counted3 t 
       where t.timestested>1 
       and c3.CLIENT_ID=t.CLIENT_ID 
       ORDER BY t.testfreq DESC) AS TopHalf)
    ) / 2 AS Median
    ) TotalAvgTestFreq
from counted3 c3

group by client_id

I added the c3 alias to the outer CLIENT_ID references and the outer table.

Sebastian Meine
  • 10,331
  • 23
  • 37
  • thank you very much that's probably what the problem was but unfortunately this query has been running for liek 15 min so far – Alex Gordon Oct 29 '12 at 18:29
1

Note: If testFreq is an int or bigint type, you need to CAST it before taking an average, otherwise you'll get integer division, e.g. (2+5)/2 => 3 if 2 and 5 are the median records - e.g. AVG(Cast(testfreq as float)).

select client_id, avg(testfreq) median_testfreq
from
(
    select client_id,
           testfreq,
           rn=row_number() over (partition by CLIENT_ID
                                 order by testfreq),
           c=count(testfreq) over (partition by CLIENT_ID)
    from tbk
    where timestested>1
) g
where rn in (round(c/2,0),c/2+1)
group by client_id;

The median is found either as the central record in an ODD number of rows, or the average of the two central records in an EVEN number of rows. This is handled by the condition rn in (round(c/2,0),c/2+1) which picks either the one or two records required.

RichardTheKiwi
  • 99,095
  • 24
  • 183
  • 252