7

I'm having trouble with calculating the median of a list of values, not the average.

I found this article Simple way to calculate median with MySQL

It has a reference to the following query which I don't understand properly.

SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2

If I have a time column and I want to calculate the median value, what do the x and y columns refer to?

JoSSte
  • 2,210
  • 5
  • 25
  • 40
Tim
  • 2,045
  • 4
  • 20
  • 28
  • Note that the solution you mention will not find the median if there are duplicate values. (It fails when the median itself has duplicates) – Mark Biesheuvel Nov 06 '13 at 14:40
  • 1
    I honestly don't understand how MySQL is used by millions of people and have been around for decades but doesn't have a function to calculate a median. Are there any other data-centric systems that haven't implemented math that is usually taught to 9–10 year olds in 4th grade? – Monica Heddneck Aug 19 '16 at 19:41

7 Answers7

10

I propose a faster way.

Get the row count:

SELECT CEIL(COUNT(*)/2) FROM data;

Then take the middle value in a sorted subquery:

SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;

I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.

This will find an arbitrary percentile by replacing the COUNT(*)/2 with COUNT(*)*n where n is the percentile (.5 for median, .75 for 75th percentile, etc).

Reggie Edwards
  • 329
  • 3
  • 3
  • 4
    Good solution but if there is an odd number of items you should prob get the mean of the two middle points `SELECT avg(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue, @numvalues) x;` where @numvalues is `(@middlevalue mod 2) +1` – Ella Ryan Oct 31 '13 at 00:50
2

val is your time column, x and y are two references to the data table (you can write data AS x, data AS y).

EDIT: To avoid computing your sums twice, you can store the intermediate results.

CREATE TEMPORARY TABLE average_user_total_time 
      (SELECT SUM(time) AS time_taken 
            FROM scores 
            WHERE created_at >= '2010-10-10' 
                    and created_at <= '2010-11-11' 
            GROUP BY user_id);

Then you can compute median over these values which are in a named table.

EDIT: Temporary table won't work here. You could try using a regular table with "MEMORY" table type. Or just have your subquery that computes the values for the median twice in your query. Apart from this, I don't see another solution. This doesn't mean there isn't a better way, maybe somebody else will come with an idea.

Krab
  • 1,983
  • 12
  • 21
  • Thanks for that @Krab! Don't suppose you could help me with the following. SELECT AVG(time_taken) FROM ( SELECT SUM(`time`) AS time_taken FROM scores WHERE created_at >= '2010-10-10' and created_at <= '2010-11-11' GROUP BY user_id) AS average_user_total_time" ) to calculate the average of the totals of users scores but not sure how to apply the median formula to this query. Sorry for the re-post, timed out. – Tim Mar 23 '11 at 06:00
  • When I try that, I get "cannot reopen table x". Here is my total sql. CREATE TEMPORARY TABLE average_user_total_time (SELECT SUM(time) AS time_taken FROM scores WHERE created_at >= '2010-10-10' and created_at <= '2010-11-11' GROUP BY user_id); SELECT x.time_taken from average_user_total_time as x, average_user_total_time as y GROUP BY x.time_taken HAVING SUM(SIGN(1-SIGN(y.time_taken-x.time_taken))) = (COUNT(*)+1)/2 – Tim Mar 23 '11 at 23:25
1

Finding median in mysql using group_concat

Query:

SELECT
    IF(count%2=1,
       SUBSTRING_INDEX(substring_index(data_str,",",pos),",",-1),
       (SUBSTRING_INDEX(substring_index(data_str,",",pos),",",-1) 
         + SUBSTRING_INDEX(substring_index(data_str,",",pos+1),",",-1))/2) 
    as median 
FROM (SELECT group_concat(val order by val) data_str,
      CEILING(count(*)/2) pos,
      count(*) as count from data)temp;

Explanation:

Sorting is done using order by inside group_concat function

Position(pos) and Total number of elements (count) is identified. CEILING to identify position helps us to use substring_index function in the below steps.

Based on count, even or odd number of values is decided.

  • Odd values: Directly choose the element belonging to the pos using substring_index.
  • Even values: Find the element belonging to the pos and pos+1, then add them and divide by 2 to get the median.

Finally the median is calculated.

1

If you have a table R with a column named A, and you want the median of A, you can do as follows:

SELECT A FROM R R1
WHERE ( SELECT COUNT(A) FROM R R2 WHERE R2.A < R1.A ) = ( SELECT COUNT(A) FROM R R3 WHERE R3.A > R1.A )

Note: This will only work if there are no duplicated values in A. Also, null values are not allowed.

  • @nicholas-de-bin How does this work when there are an even number of rows in the column? Because, the traditional logic is for even numbers we need to return the average of the two numbers in the middle. This is not handled in the above query. Please correct me if wrong. – Amitrajit Bose Aug 17 '19 at 20:22
1

First try to understand what the median is: it is the middle value in the sorted list of values.

Once you understand that, the approach is two steps:

  1. sort the values in either order
  2. pick the middle value (if not an odd number of values, pick the average of the two middle values)

Example:

Median of 0 1 3 7 9 10: 5 (because (7+3)/2=5)
Median of 0 1 3 7 9 10 11: 7 (because 7 is the middle value)

So, to sort dates you need a numerical value; you can get their time stamp (as seconds elapsed from epoch) and use the definition of median.

Escualo
  • 36,702
  • 18
  • 79
  • 122
  • 1
    disagree on your first example: median is always an actual member of the set – zanlok Mar 24 '11 at 20:51
  • 4
    @zanlok: any "well accepted" software package will compute the median as I presented it (average value if even number of values) Matlab averages, R averages. What you are talking about is the "medoid", where the value is always a member of the data set. – Escualo Mar 25 '11 at 19:27
1

Simplest ways me and my friend have found out... ENJOY!!

SELECT count(*) INTO @c from station;
select ROUND((@c+1)/2) into @final; 
SELECT round(lat_n,4) from station a where @final-1=(select count(lat_n) from station b where b.lat_n > a.lat_n);
noam621
  • 2,518
  • 1
  • 13
  • 24
0

Here is a solution that is easy to understand. Just replace Your_Column and Your_Table as per your requirement.

SET @r = 0;

SELECT AVG(Your_Column)
FROM (SELECT (@r := @r + 1) AS r, Your_Column FROM Your_Table ORDER BY Your_Column) Temp
WHERE
    r = (SELECT CEIL(COUNT(*) / 2) FROM Your_Table) OR
    r = (SELECT FLOOR((COUNT(*) / 2) + 1) FROM Your_Table)

Originally adopted from this thread.

Amitrajit Bose
  • 478
  • 4
  • 13