0

I am using MySQL, and My SQL table looks like this:

sales_year (INT), sales_month (INT), sales_day (INT), price (float), customer_type (TEXT)

I'd like to know which sql_query would aggregate price data by quarter (compute the median price for each quarter, and how many observations where used to compute the median), and grouped by customer type.

I am struggling with two major steps: Median does not seem to be supported by mySQL, and also how to aggregate data by quarter - seems grouping by customer type it's very easy once those two are solved.

STRUGGLE - Computing the median....

I for example just tried creating a quarter column and it works but it computes the AVG instead of the median:

    select avg(price) as avg_price, floor(sales_month/3.0+1) as
    sales_quarter, count(*) as n_transactions, sales_year, customer_type
    from mydb.mytable
    group by sales_quarter, sales_year, customer_type;

This command works perfectly fine. But ideally I could change avg by MEDIAN but mySQL does not have such support, any suggestions on how to change this code to make it work for median purposes?

Note: I also tried installing my own median function from user defined functions in this site but the C code did not compile on my mac os X.

So the output would look like this:

sales_quarter (INT)
sales_year (INT)
median_price (FLOAT)
number_users_used_to_compute_median (INT)
customer_type (TEXT)
Dnaiel
  • 6,884
  • 21
  • 58
  • 113
  • There are some potential answers to the median part of your question [here](http://stackoverflow.com/questions/1291152/simple-way-to-calculate-median-with-mysql). The accepted answer appears to be outdated. Some of the others may be worth trying. – Patrick Q Aug 12 '14 at 15:11
  • @PatrickQ I am looking into it, thanks! seems complicated though, specially now that I already have a long code... – Dnaiel Aug 12 '14 at 15:47
  • FLOAT? Surely DECIMAL, and consider storing dates as dates. Also, what's your PRIMARY KEY? – Strawberry Aug 18 '14 at 12:12

4 Answers4

1

To get the median, you could try something like,

SELECT *
FROM table
LIMIT COUNT(*)/2, 1

This basically says: "Give me 1 item starting at the n/2th element, where n is the size of the set."

So, if you were doing it by quarter it would be the same thing with some GROUP BY quarter type things thrown in. Let me know if you'd like me to expand on this more.

Jean
  • 650
  • 1
  • 5
  • 14
1

Reference to velcrow's answer and post SqlFiddle here.

select quarter,
       group_concat(val order by row_number) ValSortString,
       floor((max(row_number) - min(row_number))/2)+1 as FirstPosition,
       ceil((max(row_number) - min(row_number))/2) +1 as SecondPosition,
       split_str(group_concat(val order by row_number),',',floor((max(row_number) - min(row_number))/2)+1) as FirstVal,
       split_str(group_concat(val order by row_number),',',ceil((max(row_number) - min(row_number))/2)+1) as SecondVal,
       (split_str(group_concat(val order by row_number),',',floor((max(row_number) - min(row_number))/2)+1) +
        split_str(group_concat(val order by row_number),',',ceil((max(row_number) - min(row_number))/2)+1) )/2 as Median
from (
      SELECT data.quarter,@rownum:=@rownum+1 as row_number, data.val,total_rows
             FROM data ,  
                  (select quarter,count(*) as total_rows from data group by quarter) as t,
                  (SELECT @rownum:=0) r
             where t.quarter = data.quarter
             order by data.quarter,val
     ) as b
group by quarter

This code only group by quarter, easy to expand other group by columns.

I use group_concat and split_str to simplify it only using one subquery.

So you have to create split_str function:

CREATE FUNCTION SPLIT_STR(
  x VARCHAR(255),
  delim VARCHAR(12),
  pos INT
)
RETURNS VARCHAR(255)
RETURN REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos),
       LENGTH(SUBSTRING_INDEX(x, delim, pos -1)) + 1),
       delim, '');

The problem is group_concat and split_str has limit. But this version can solve the problem with only on subquery and easy to understand.

UPDATE
According to Gordon Linoff's indicator, I add another solution without group_concat.

select quarter,
       floor((total_rows + 1)/2) as FirstPosition,
       ceil((total_rows + 1)/2) as SecondPosition,
       avg(val) as median
from (
      SELECT data.quarter,
             @rownum:= if (@q = data.quarter ,@rownum+1,if(@q := data.quarter, 1, 1) )as row_number, 
             data.val,
             total_rows
             FROM data ,  
                  (select quarter,count(*) as total_rows from data group by quarter) as t,
                  (SELECT @q := '', @rownum:=0) r
             where t.quarter = data.quarter
             order by data.quarter,val
     ) as b
where row_number in (floor((total_rows + 1)/2), ceil((total_rows + 1)/2))
group by quarter

And new Sql Fiddle here.

I'm a newbie to mysql, this question is easy to slove by MSSql, DB2 or Oracle, they all have Row_number()(Partition by ...).

I don't have enough reputation to comment on Gordon Linoff's answer, I have to thank him of learning how to implement row_number()(Partition by ...) function.

Community
  • 1
  • 1
Jaugar Chang
  • 3,116
  • 2
  • 11
  • 21
1

Oh, just call the average the median. People you talk to usually won't know the difference (;).

Okay, seriously, you can do this in MySQL. There is a method using group_concat() and substring_index(), but that runs the risk of overflowing the intermediate string values. Instead, enumerate the values and do simple arithmetic. For this, you need an enumeration and a total. The enumeration is:

  select t.*,
         @rn := if(@q = quarter and @y = @year and @ct = customer_type,
                   @rn + 1,
                   if(@q := quarter, if(@y := @year, if(@ct := customer_type, 1, 1), 1), 1)
                  ) as rn
  from mydb.mytable t cross join
       (select @q := '', @y := '', @ct := '', @rn := 0) vars
  order by sales_quarter, sales_year, customer_type, price;

This is carefully formulated. The order by columns correspond to the variables defined. There is only one statement that assigns variables in the select. The nested if() statements ensure that each variable gets set (using an and or or could result in short-circuiting). It is important to remember that MySQL does not guarantee the order of evaluation for expressions in the select, so having only one statement set variables is important to ensure correctness.

Now, getting the median is pretty easy. You need the total count, the sequential value (rn) and some arithmetic to handle the case where there are an even number of values:

select trn.sales_quarter, trn.sales_year, trn.customer_type, avg(price) as median
from (select t.*,
             @rn := if(@q = quarter and @y = @year and @ct = customer_type,
                       @rn + 1,
                       if(@q := quarter, if(@y := @year, if(@ct := customer_type, 1, 1), 1), 1)
                      ) as rn
      from mydb.mytable t cross join
           (select @q := '', @y := '', @ct := '', @rn := 0) vars
      order by sales_quarter, sales_year, customer_type, price
     ) trn join
     (select sales_quarter, sales_year, customer_type, count(*) as numrows
      from mydb.mytable t
      group by sales_quarter, sales_year, customer_type
     ) s
     on trn.sales_quarter = s.sales_quarter and
        trn.sales_year = s.sales_year and
        trn.customer_type = s.customer_type
where 2*rn in (numrows, numrows - 1, numrows + 1)
group by trn.sales_quarter, trn.sales_year, trn.customer_type;

Just to emphasize that the final average is not doing an average calculation. It is calculating the median. The normal definition is that for an even number of values, the median is the average of the two in the middle. The where clause handles both the even and odd cases.

Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
0

I know of two ways to do it. The first uses two selects and a join, the first select to get the values and rankings, and the second select to get the counts, then joins them. The second uses json functions to get everything in one select. They are both a little lengthy, but they work and are reasonably fast.

SOLUTION #1 (two selects and a join, one to get counts, one to get rankings)

SELECT  x.group_field, 
        avg(
            if( 
                x.rank - y.vol/2 BETWEEN 0 AND 1, 
                value_field, 
                null
            )
        ) as median
FROM (
    SELECT  group_field, value_field, 
            @r:= IF(@current=group_field, @r+1, 1) as rank, 
            @current:=group_field
    FROM (
        SELECT group_field, value_field
        FROM table_name
        ORDER BY group_field, value_field
    ) z, (SELECT @r:=0, @current:='') v
) x, (
    SELECT group_field, count(*) as vol 
    FROM table_name
    GROUP BY group_field
) y WHERE x.group_field = y.group_field
GROUP BY x.group_field;

SOLUTION #2 (uses a json object to store the counts and avoids the join)

SELECT group_field, 
    avg(
        if(
            rank - json_extract(@vols, path)/2 BETWEEN 0 AND 1,
            value_field,
            null
        )
    ) as median
FROM (
    SELECT group_field, value_field, path, 
        @rnk := if(@curr = group_field, @rnk+1, 1) as rank,
        @vols := json_set(
            @vols, 
            path, 
            coalesce(json_extract(@vols, path), 0) + 1
        ) as vols,
        @curr := group_field
    FROM (
        SELECT p.group_field, p.value_field, concat('$.', p.group_field) as path
        FROM table_name
        JOIN (SELECT @curr:='', @rnk:=1, @vols:=json_object()) v
        ORDER BY group_field, value_field DESC
    ) z
) y GROUP BY group_field;
Chris Strickland
  • 878
  • 1
  • 6
  • 16