82

I am performing some tests on a HSQLDB server with a table containing 500 000 entries. The table has no indices. There are 5000 distinct business keys. I need a list of them. Naturally I started with a DISTINCT query:

SELECT DISTINCT business_key FROM memory WHERE
   concept <> 'case' or 
   attrib <> 'status' or 
   value <> 'closed'

It takes around 90 seconds!!!

Then I tried using GROUP BY:

SELECT business_key FROM memory WHERE
       concept <> 'case' or 
       attrib <> 'status' or 
       value <> 'closed'
GROUP BY business_key

And it takes 1 second!!!

Trying to figure out the difference I ran EXLAIN PLAN FOR but it seems to give the same information for both queries.

EXLAIN PLAN FOR DISTINCT ...

isAggregated=[false]
columns=[
  COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
  join type=INNER
  table=MEMORY
  alias=M
  access=FULL SCAN
  condition = [    index=SYS_IDX_SYS_PK_10057_10058
    other condition=[
    OR arg_left=[
     OR arg_left=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
       VALUE = case, TYPE = CHARACTER]] arg_right=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
       VALUE = status, TYPE = CHARACTER]]] arg_right=[
     NOT_EQUAL arg_left=[
      COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
      VALUE = closed, TYPE = CHARACTER]]]
  ]
]]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

EXLAIN PLAN FOR SELECT ... GROUP BY ...

isDistinctSelect=[false]
isGrouped=[true]
isAggregated=[false]
columns=[
  COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
  join type=INNER
  table=MEMORY
  alias=M
  access=FULL SCAN
  condition = [    index=SYS_IDX_SYS_PK_10057_10058
    other condition=[
    OR arg_left=[
     OR arg_left=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
       VALUE = case, TYPE = CHARACTER]] arg_right=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
       VALUE = status, TYPE = CHARACTER]]] arg_right=[
     NOT_EQUAL arg_left=[
      COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
      VALUE = closed, TYPE = CHARACTER]]]
  ]
]]
groupColumns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

EDIT: I did additional tests. With 500 000 records in HSQLDB with all distinct business keys, the performance of DISTINCT is now better - 3 seconds, vs GROUP BY which took around 9 seconds.

In MySQL both queries preform the same:

MySQL: 500 000 rows - 5 000 distinct business keys: Both queries: 0.5 second MySQL: 500 000 rows - all distinct business keys: SELECT DISTINCT ... - 11 seconds SELECT ... GROUP BY business_key - 13 seconds

So the problem is only related to HSQLDB.

I will be very grateful if someone can explain why there is such a drastic difference.

Matt Ball
  • 332,322
  • 92
  • 617
  • 683
Martin Dimitrov
  • 4,490
  • 3
  • 40
  • 54
  • 2
    please show the result of `EXPLAIN PLAN` AND try running the `DISTINCT`query AFTER you run the `GROUP BY` to see whether perhaps some caching is skewing the timing... – Yahia Oct 30 '11 at 08:31
  • Given you get the same plan for each query, it sounds like either the table data or the result has been cached. – a'r Oct 30 '11 at 08:32
  • I ran them so many times that believe caching is not an issue. I am posting the `EXLAIN PLAN FOR` output. – Martin Dimitrov Oct 30 '11 at 08:36
  • I have an idea, but I am really not sure - please try `SELECT DISTINCT business_key FROM (SELECT business_key FROM memory WHERE concept <> 'case' or attrib <> 'status' or value <> 'closed')` - this ought to show the same performance you see with the `GROUP BY` IF my idea is right. – Yahia Oct 30 '11 at 09:08
  • @Yahia: still very slow - 94 seconds. I will run the same queries in MySQL to see what will show – Martin Dimitrov Oct 30 '11 at 09:37
  • That's not what pinal dave says. http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/ – brumScouse Sep 11 '15 at 16:20

1 Answers1

79

The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct approach is executed like:

  • Copy all business_key values to a temporary table
  • Sort the temporary table
  • Scan the temporary table, returning each item that is different from the one before it

The group by could be executed like:

  • Scan the full table, storing each value of business key in a hashtable
  • Return the keys of the hashtable

The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.

Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.

Andrew H
  • 5
  • 4
Andomar
  • 216,619
  • 41
  • 352
  • 379
  • Thanks for the reply. Are your guesses evident from the `EXPLAIN` output? Both look the same to me. – Martin Dimitrov Oct 30 '11 at 10:54
  • As far as I can see, the plan doesn't specify how it will execute the join. I'm not even sure why it would execute a join. It probably takes an HSQLDB specialist to read the explain output. – Andomar Oct 30 '11 at 11:03
  • As the answer indicates, the second method uses more memory and may hit garbage collection (GC) too often. If you increase the JVM memory allocation, there shouldn't be a huge difference between the two query times. – fredt Oct 30 '11 at 21:23
  • I did additional test by entering all distinct keys in the table (see above). Do you thing the result proves your point? Thanks a lot. – Martin Dimitrov Oct 31 '11 at 07:07
  • There are to much variables in database optimization to really prove anything. However, it seems consistent in the sense that the first approach would be relatively faster when all keys are different. – Andomar Oct 31 '11 at 07:47
  • Fair enough. Thanks for the help. – Martin Dimitrov Oct 31 '11 at 10:04
  • in 200,000 record group_by spends 500ms and distinct spends 60ms. – Hossein Oct 03 '13 at 20:50
  • 2
    Can a SME- expert please explain this in more details with examples... I've had this issue a many times but don't seem to get around it... I know the fix but I want to know how and WHY – singhswat Dec 03 '15 at 13:08