0

I have a table called Graduates recording name and income for different graduates. Now I need to calculate the median of income. Here is the code from a book.

My question is

  1. What is the result from having clause?
  2. What is the result from self join ?
SELECT AVG(DISTINCT income)
FROM (
       SELECT T1.income
       FROM Graduates T1, Graduates T2
       GROUP BY T1.income
       HAVING SUM(CASE WHEN T2.income >= T1.income THEN 1 ELSE 0 END) 
                   >= COUNT(*) / 2 
           AND SUM(CASE WHEN T2.income <= T1.income THEN 1 ELSE 0 END) 
                   >= COUNT(*) / 2 
       ) TMP;

goodareny
  • 21
  • 3
  • 1
    I honestly don't know if this approach will work. SUGGESTION: See if there's a DB-specific "median()" function available to you. Look here for alternatives: [MSSQL](https://stackoverflow.com/questions/1342898), [MySql](https://mariadb.com/kb/en/median/), [Oracle](https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions086.htm), [PostgreSQL](https://wiki.postgresql.org/wiki/Aggregate_Median) – FoggyDay Jun 05 '20 at 21:12
  • This code runs correctly in PostgreSQL. Thx. I will try. But I also wanna look for a method applied to different sql language. – goodareny Jun 05 '20 at 21:16
  • 1
    Q: Does it run correctly ... or does it merely *appear* to run correctly, based on the test cases you happened to try? I don't know :( Look [here](https://www.red-gate.com/simple-talk/sql/t-sql-programming/calculating-the-median-value-within-a-partitioned-set-using-t-sql) for more details on some of the subtleties of trying to compute a median. To your original question: the "having" and the "self join" is just a way to partition the values from which you determine a "median". – FoggyDay Jun 05 '20 at 21:19
  • I removed the inconsistent database tags. Please tag only with the database you are using. In addition, this is about the worse code I've seen for calculating a median. You might want to get another book. – Gordon Linoff Jun 05 '20 at 21:20
  • This code tries to find the overlap of two unions from what it shows in having clause. But I really can't understand how it works. Anyway, this has a correct result. – goodareny Jun 05 '20 at 21:28

1 Answers1

2

What the code is doing is finding the one or two middle values. It is doing this by counting the number of values bigger than and less than the value.

Each of the SUM()s in the HAVING clause is counting the number of values greater than or less than a given income. What the expression is doing is saying something to the effect:

The middle value value(s) are the ones that have the same number of values bigger and less than itself.

The median is then the average of the middle values. If there is one, then the average is the value itself. If there are two the average is the median.

This is awful for multiple reasons:

  • It only works on numeric values. But medians are defined for strings and dates as well.
  • It requires a self-join, which is expensive.
  • It is rather indecipherable.
  • It is not obvious how to get medians within groups rather than over the entire dataset.
Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
  • @Gordon Linoff: excellent summary. 1) I agree: "This is awful for multiple reasons". 2) The OP wanted to understand the code she cited. I tried to explain, but you did a much better job. Kudos! 3) The OP would like a vendor-agnostic, pure-SQL solution. That's reasonable ... but using vendor-specific "Median()" functions is probably safer, and certainly easier and more efficient. 4) For PostgreSQL, I'd recommend [Aggregate Median](https://wiki.postgresql.org/wiki/Aggregate_Median). – FoggyDay Jun 05 '20 at 21:47
  • @FoggyDay . . . I think `percentile_cont()` is standard SQL and not vendor-specific. – Gordon Linoff Jun 06 '20 at 00:34