Is the HAVING clause redundant?

Question

The following two queries yield the exact same result:

select country, count(organization) as N
from ismember
group by country
having N > 50;

select * from (
  select country, count(organization) as N
  from ismember
  group by country) x
where N > 50;

Can every HAVING clause be replaced by a sub-query and a WHERE clause like this? Or are there situations where a HAVING clause is absolutely necessary/more powerful/more efficient/whatever?

You should define RDBMS to the question I guess. Your first query is not valid in SQL Server 2008, because you can't make a reference in the Having on an alias in the select. Only in the OrderBy part, because of the Logical query processing. — András Ottó, Aug 25 '12 at 10:23
I suspect MySQL? The first query is not valid in Oracle either for the same reason. — Ben, Aug 25 '12 at 10:25
See [HAVING A Blunderful Time or Wish You Were WHERE](http://www.dcs.warwick.ac.uk/~hugh/TTM/HAVING-A-Blunderful-Time.html) — Martin Smith, Aug 25 '12 at 16:47

Eugen Rieck · Accepted Answer · 2012-08-25T11:08:17.597

There are 2 questions asked here: The answer to the first of which is yes: The resultset of a HAVING-laden query is identical to the resultset of the same query executed as a subquery, decorated with a WHERE clause.

The second question is about performance and expressivity - here we go heavily into implementation. On MySQL there is a thin red line, where the performance starts to drift apart: The moment the resultset of the inner query can no longer be held in memory. In this case, MySQL will create an on-disk representation of the inner query, then use the WHERE selector on it. This will not happen, if the HAVING clause is used, the disqualified group will be dropped from the result set.

This implies, that the higher the selectivity of the HAVING clause, the more performance relevance it has: Consider result set of a million rows of the inner query, that is reduce by the HAVING clause to 5 rows - it is very likely, that the result set of the inner query wouldn't be held in memory, but it is very likely, that the final result set would.

Edit

I had this once: The query selected the few outliers from a very evenly distributed table (Number of pieces produced on a physical machine in a workshop per day). I investigated because of the high IO-load.

Edit 2

Please keep in mind, that the query cache is not used for subqueries - IMHO a place development should focus more on - so the subquery pattern will not profit from the inner query being a cached result set.

score 8 · Answer 2 · answered Aug 25 '12 at 10:41

8

In Sql Server 2008 two similar queries have exactly the same execution plan:

enter image description here

I've also studied a lot of queries generated by Entity Framework (with SS 2008) and so far I never saw a query with a HAVING clause. Grouping queries with a condition on an aggregated result are always translated into a query with a sub query. I trust the ADO.Net team knows with they're doing...

answered Aug 25 '12 at 10:41

Gert Arnold

93,904
24
179
256

I wouldn't trust that at all. EF (and Linq-to-SQL) produce notoriously bad queries. – Rob Farley Aug 25 '12 at 10:46
1

@RobFarley I know that they can't compete with manually crafted and optimized queries but for automated queries they're not that bad. You should know some do's and dont's when writing linq, though. – Gert Arnold Aug 25 '12 at 10:49
maybe the are similar as Sql Server converted the subquery version to the aggregate version of the query? ツ – Michael Buen Aug 25 '12 at 11:34

score 4 · Answer 3 · answered Aug 25 '12 at 10:34

4

The HAVING clause is very useful to avoid the added complexity of sub-queries. However, the two are logically equivalent and every HAVING clause can be rewritten using a sub-query as you have.

In case you're curious, you could also write every WHERE clause as a HAVING clause if you're prepared to take GROUP BY to the extreme.

answered Aug 25 '12 at 10:34

Rob Farley

14,659
5
40
54

Not sure your last line is true is it? Suppose a table with one column called `number` and three rows `VALUES (1),(1),(2)` How can you simulate `SELECT number FROM T WHERE number = 1` with `HAVING`? – Martin Smith Aug 25 '12 at 16:58
That would only return one row. – Martin Smith Aug 25 '12 at 22:44
Oh, sorry - misread the list of numbers (went over two lines on my mobile). You could introduce a differentiator like a row_number, and include that in the group expression. If you group by something that's unique, HAVING and WHERE become equivalent. – Rob Farley Aug 26 '12 at 08:08

score 1 · Answer 4 · answered Aug 25 '12 at 10:41

I know, that you changed it from general to MySQL, but I would like to add here a (may usefull) note. With a little modification I tried your query in SQL Server 2008.

Just for anybody who wants more detail in it, the executionplan of the two query is even exactly the same in SQL Server 2008. So the optimalizer processing the two command on the same way with the same performance and estimations.

score 0 · Answer 5 · edited Aug 25 '12 at 10:28

0

IMHO, using the HAVING clause should be efficient because there would an additional pass on the worktable that contains the grouped results on top of which the filtering criteria is run, in the second case.

edited Aug 25 '12 at 10:28

Himanshu Jansari

28,446
26
101
128

answered Aug 25 '12 at 10:27

Vikdor

22,825
9
55
81

Sub-queries don't get expanded into worktables. The two queries (albeit with the alias problem removed for other platforms) should be treated identically. – Rob Farley Aug 25 '12 at 10:31
@RobFarley This is not entirely true: If the result set exceeds a certain size, it will be materialized. – Eugen Rieck Aug 25 '12 at 10:33
1

Ok. Not in SQL Server or Oracle. Those systems will simplify the query out. – Rob Farley Aug 25 '12 at 10:35
1

I was specifically talking MySQL - should have made this clearer. Sorry for that. – Eugen Rieck Aug 25 '12 at 10:36

score 0 · Answer 6 · answered Aug 25 '12 at 14:36

Logically yes the result will be the same on the end. But performance might differ. The HAVING clause might lead the DB to change a different execution plan.

A note to the guys above (can't directly comment somehow) - the execution plan does not only depend on your query. It might also get adjusted by the DB depending on statistics, like table size etc on runtime. That said for DB2 at least...

Is the HAVING clause redundant?

6 Answers6