7

I have the following MySQL query which is working perfectly:

select 
    count(*) as `# of Data points`, 
    name, 
    max((QNTY_Sell/QNTYDelivered)*1000) as `MAX Thousand Price`,
    min((QNTY_Sell/QNTYDelivered)*1000) as `MIN Thousand Price`,
    avg((QNTY_Sell/QNTYDelivered)*1000) as `MEAN Thousand Price` 
from 
    table_name 
where 
    year(date) >= 2012 and 
    name like "%the_name%" and 
    QNTYDelivered > 0 and 
    QNTY_Sell > 0 
group by name 
order by name;

Now I wish to also add a result column that gives me the MEDIAN of the data for each line. Under SELECT this would look like this in a perfect world:

median((QNTY_Sell/QNTYDelivered)*1000) as `MEDIAN Thousand Price`

Searching Google for a MySQL median function brought me to this answer, which seems ok if you are interested in the median of a data set for an entire table: Simple way to calculate median with MySQL

The difference here is that I am grouping the data in my table by the name column, and want to get the median for each line of the data grouped by this column.

Does anyone know a nifty trick to allow me to do this?

Thanks!

Community
  • 1
  • 1
jeffery_the_wind
  • 13,565
  • 31
  • 87
  • 146
  • there is no median function in mysql – Alison S Oct 29 '13 at 09:33
  • 3
    Does it have to be in the same query? If you can run a second query you can calculate the midpoint of the set now that you know the number of data points. The midpoint is either one or two rows. Run the same query but add LIMIT [midpoint], [midpoint mod 2] and return the average. – Ella Ryan Oct 30 '13 at 06:50
  • You'll have to change the order by if you do the above - I assume you want the median of the QNTY_Sell/QNTYDelivered so you'd have to order on that to find the midpoint. – nickL Jan 22 '14 at 18:50

2 Answers2

3

The only way I found to do this is through string manipulation:
with GROUP_CONCAT a list of all value is created then with indented SUBSTRING_INDEX the median value is taken

SELECT
    count(*) AS `# of Data points`,
    name,
    max((QNTY_Sell/QNTYDelivered)*1000) AS `MAX Thousand Price`,
    min((QNTY_Sell/QNTYDelivered)*1000) AS `MIN Thousand Price`,
    avg((QNTY_Sell/QNTYDelivered)*1000) AS `MEAN Thousand Price`
  , CASE (count(*) % 2)
    WHEN 1 THEN SUBSTRING_INDEX(
      SUBSTRING_INDEX(
        group_concat((QNTY_Sell/QNTYDelivered)*1000 
                      ORDER BY (QNTY_Sell/QNTYDelivered)*1000 SEPARATOR ',')
      , ',', (count(*) + 1) / 2)
    , ',', -1)
    ELSE (SUBSTRING_INDEX(
      SUBSTRING_INDEX(
        group_concat((QNTY_Sell/QNTYDelivered)*1000 
                      ORDER BY (QNTY_Sell/QNTYDelivered)*1000 SEPARATOR ',')
      , ',', count(*) / 2)
    , ',', -1)
  + SUBSTRING_INDEX(
      SUBSTRING_INDEX(
        group_concat((QNTY_Sell/QNTYDelivered)*1000 
                      ORDER BY (QNTY_Sell/QNTYDelivered)*1000 SEPARATOR ',')
      , ',', (count(*) + 1) / 2)
    , ',', -1)) / 2
    END median
FROM
    sales
WHERE
    year(date) >= 2012 AND
    name LIKE "%art.%" AND
    QNTYDelivered > 0 AND
    QNTY_Sell > 0
GROUP BY name
ORDER BY name;  

The CASE is needed to check if we have a single median value, with an odd number of values, or two median values, with an even number of values, in the second case the median is the mean of the two values founded.

SQLFiddle

Serpiton
  • 3,510
  • 3
  • 22
  • 34
  • Looking at this beast of a solution I really - like, REALLY - wonder why SQL platforms collectively refuse to implement Median and Mode functions. Having AVG, MIN, MAX but not the other two is quite half-arsed... – Tomm Jun 13 '14 at 10:16
  • Don't get me wrong, I applaud you for your solution. But it feels kind of ridiculous that we have to come up with queries the size of Dostjewski's collected works only to get the bloody median. I ended up here because I need to get the median and mode in the context of several nested subqueries; and the realisation that it is so difficult to get something as seemingly trivial as the Median is slightly annoying. – Tomm Jun 13 '14 at 10:44
  • Hey, I implemented your solution but noticed that the Median of an even number of values is 'off by one unit', i.e. in an array of 10 values, your solution places the Median between the 4th and 5th value, rather than the 5th and 6th. I fixed this by using `(count(*) / 2) + 1` in the two `SUBSTRING_INDEX` calls of the `ELSE` statement. – Tomm Jun 13 '14 at 13:18
3

You can calculate the median with GROUP BY in MySQL even though there is no median function built in.

Consider the table:

Acrington   200.00
Acrington   200.00
Acrington   300.00
Acrington   400.00
Bulingdon   200.00
Bulingdon   300.00
Bulingdon   400.00
Bulingdon   500.00
Cardington  100.00
Cardington  149.00
Cardington  151.00
Cardington  300.00
Cardington  300.00

For each row you can count the number of similar items that are less. You can also count how many values are less than or equal:

name        v       <   <=
Acrington   200.00  0   2
Acrington   200.00  0   2
Acrington   300.00  2   3
Acrington   400.00  3   4
Bulingdon   200.00  0   1
Bulingdon   300.00  1   2
Bulingdon   400.00  2   3
Bulingdon   500.00  3   4
Cardington  100.00  0   1
Cardington  149.00  1   2
Cardington  151.00  2   3
Cardington  300.00  3   5
Cardington  300.00  3   5

With query

SELECT name,v, (SELECT COUNT(1) FROM sale WHERE v<o.v AND name=o.name) as ls
             , (SELECT COUNT(1) FROM sale WHERE v<=o.v AND name=o.name) as lse
  FROM sale o

The median value will occur when the less-than-or-equal count is half the number of items

  • Acrington has 4 items. Half of this is 2 which is in the range 0..2 (corresponding to 200.00) and also in the range 2..3 (corresponding to 300.00)

  • Bullingdon also has 4 items. 2 is in the range 1..2 (value 300.00) and 2..3 (value 400.00)

  • Cardington has 5 items. The value 2.5 is between 2 and 3 which corresponds to Cardington 151.

The median value is the mean of the min and max values returned by:

SELECT cs.name,v
   FROM
   (SELECT name,v, (SELECT COUNT(1) FROM sale WHERE v<o.v AND name=o.name) as ls
                 , (SELECT COUNT(1) FROM sale WHERE v<=o.v AND name=o.name) as lse
      FROM sale o) cs JOIN
   (SELECT name,COUNT(1)*.5 as cn
      FROM sale
      GROUP BY name) cc ON cs.name=cc.name
 WHERE cn between ls and lse

Which gives:

Acrington   200.00
Acrington   200.00
Acrington   300.00
Bulingdon   300.00
Bulingdon   400.00
Cardington  151.00

Finally we can get the median:

SELECT name,(MAX(v)+MIN(v))/2 FROM
(SELECT cs.name,v
   FROM
   (SELECT name,v, (SELECT COUNT(1) FROM sale WHERE v<o.v AND name=o.name) as ls
                 , (SELECT COUNT(1) FROM sale WHERE v<=o.v AND name=o.name) as lse
      FROM sale o) cs JOIN
   (SELECT name,COUNT(1)*.5 as cn
      FROM sale
     GROUP BY name) cc ON cs.name=cc.name
 WHERE cn between ls and lse
 ) AS medians
GROUP BY name

Giving

Acrington   250.000000
Bulingdon   350.000000
Cardington  151.000000
SQL Hacks
  • 1,271
  • 1
  • 10
  • 15
  • doesn't this result in quadratic runtimes? You are effectively asking the query to compare every value against every other value. – steveo'america May 31 '19 at 00:00