1

I am trying to calculate median of time spent by people on a specific category. The whole dataset I have is around 500k rows but I have tried to summarize a snippet of it below

person category time spent (in mins)
roger  dota 20
jim    dota 50
joe    call of duty 5
jim    fallout 25
kathy  GTA 40
alicia fallout 100

I have tried to use the query below but I am getting no where.

SELECT x1.person, x1.time spent 
from data x1, data x2
GROUP BY x1.val
HAVING SUM(SIGN(1-SIGN(x2.val-x1.val))) = (COUNT(*)+1)/2
ash
  • 701
  • 1
  • 8
  • 20
  • 1
    Possible duplicate of [Simple way to calculate median with MySQL](http://stackoverflow.com/questions/1291152/simple-way-to-calculate-median-with-mysql) – maraca Nov 17 '15 at 23:21
  • @maraca I tried the thread already and my query above is a mod of that. That's taking forever. So posted a new thread. – ash Nov 17 '15 at 23:26
  • Your implementation is not really the accepted solution, but the 2nd highest answer as far as I can see, the highest rated answer even talks about your problems. – maraca Nov 17 '15 at 23:29
  • @maraca I tried the most accepted soultion but I don't have where clauses in my problem. So when I remove the where clauses, the logic baffles. I am trying to fix that but am looking for an alternative solution on the side. – ash Nov 17 '15 at 23:41

1 Answers1

1

A self-join on 500,000 rows is likely to be expensive. Why not just enumerate the rows and grab the one in the middle?

select d.*
from (select d.*, (@rn := @rn + 1) as rn
      from data d cross join
           (select @rn := 0) params
      order by d.val
     ) d
where 2*rn in (@rn, @rn + 1);

The weird where clause chooses the value in the middle -- it is just an approximation if there are an eve number of rows. Because you want the actual row values, you need the approximation. The normal calculation of the median itself would be:

select avg(d.val)
from (select d.*, (@rn := @rn + 1) as rn
      from data d cross join
           (select @rn := 0) params
      order by d.val
     ) d
where 2*rn in (@rn - 1, @rn, @rn + 1);

EDIT:

The same logic works per person as well, but with a bit more logic to get the overall counts:

select d.person, avg(val) as median
from (select d.*,
             (@rn := if(@p = person, @rn + 1
                        if(@p := person, 1, 1)
             ) as rn
      from data d cross join
           (select @rn := 0, @p := '') params
      order by person, d.val
     ) d join
     (select person, count(*) as cnt
      from data
      group by person
     ) p
     on d.person = p.person
where 2*rn in (d.cnt - 1, d.cnt, d.cnt + 1)
group by person;
Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
  • thanks for the answer. I am trying to group b person as well, meaning I am looking for median of values per person – ash Nov 17 '15 at 23:25