34

The query is basically:

SELECT DISTINCT "my_table"."foo" from "my_table" WHERE...

Pretending that I'm 100% certain the DISTINCT portion of the query is the reason it runs slowly, I've omitted the rest of the query to avoid confusion, since it is the distinct portion's slowness that I'm primarily concerned with (distinct is always a source of slowness).

The table in question has 2.5 million rows of data. The DISTINCT is needed for purposes not listed here (because I don't want back a modified query, but rather just general information about making distinct queries run faster at the DBMS level, if possible).

How can I make DISTINCT run quicker (using Postgres 9, specifically) without altering the SQL (ie, I can't alter this SQL coming in, but have access to optimize something at the DB level)?

orokusaki
  • 48,267
  • 47
  • 159
  • 244
  • 3
    The full query would be needed to give an answer. Usually, one uses distinct for bad reasons. – Denis de Bernardy Jul 06 '11 at 15:33
  • 1
    @orokusaki : A query is a whole thing. It is not a distinct on a side, a join on an other side, and a group by on the third side. It is not because your query is faster whithout the distinct that the distinct is the problem... You can't answer a question like that. Or yes you can : use Index. – Cyril Gandon Jul 06 '11 at 15:40
  • 1
    @orokusaki: it really depends. Frequently, the use of distinct in a query reflects a sub-optimal join somewhere. Not always, but frequently enough. In such cases, the idea is to rewrite the query so that the sub-statement is in a sub-query that returns unique rows (or is checked using the in() clause). – Denis de Bernardy Jul 06 '11 at 16:09
  • 1
    @Denis- there is a sub-optimal join, with regards to performance, but it's impossible to avoid. I'm filtering on a M2M relationship (get all users that have [x, y, or z] in their list of `foos` (m2m). – orokusaki Jul 06 '11 at 17:36
  • 4
    I am not sure why people are giving you a hard time over making an assumption of the distinct being the slow portion. Its relatively easy to isolate query speed down to something like that. – Bagelstein Feb 27 '18 at 15:11
  • 1
    @Bagelstein thanks - I know, that’s always been a frustrating thing about SO; you have to come here kneeling and feign ignorance in order to get anyone interested to help you. If you come with any prior knowledge, you are treated in much the same way that you would be treated if you went to the doctor and said “I think I might have a problem with X”; egos are immediately on guard and you’re regarded as an idiot. – orokusaki Feb 27 '18 at 15:47

3 Answers3

42

Oftentimes, you can make such queries run faster by working around the distinct by using a group by instead:

select my_table.foo 
from my_table 
where [whatever where conditions you want]
group by foo;
rogerdpack
  • 50,731
  • 31
  • 212
  • 332
  • 1
    I can't modify the SQL, which is why I left most of the query out. – orokusaki Jul 06 '11 at 15:33
  • 1
    Well, either get the users (either the developers writing the app that's running these queries or the users who are running these queries ad hoc) to switch their SQL. If you can't do that, then you might be able to get some mileage out of indexing `my_table` on `foo`. –  Jul 06 '11 at 15:36
  • 4
    This was a great solution for me. I initially thought it was the large offset making my queries run slow, but after switching from DISTINCT to GROUP BY, they ran 20 times faster. Thanks! – xaisoft Aug 08 '13 at 18:38
  • 7
    Thanks for the great tip Jack! Swapping SELECT DISTINCT for GROUP BY reduced the runtime of my particular query from 649ms down to 87ms, almost 7.5x faster. My INNER JOIN is between a table containing ~30,000 rows (from which I wanted the matching rows) and another join table containing ~322,000 rows (which I was using to filter the first). I had already added indexes to all the columns used on both sides of my INNER JOIN and single WHERE clause, so I was hoping to find another optimization that would help speed up the query, and this worked very well in this particular case. – bluebinary Oct 30 '13 at 19:34
  • 16
    Question is... Why does this make the query faster? – Ioan Alexandru Cucu Nov 15 '17 at 18:15
  • 1
    @IoanAlexandruCucu [This answer](https://stackoverflow.com/a/45833583/974555) may contain some hints on that. – gerrit Oct 16 '19 at 09:11
24

Your DISTINCT is causing it to sort the output rows in order to find duplicates. If you put an index on the column(s) selected by the query, the database may be able to read them out in index order and save the sort step. A lot will depend on the details of the query and the tables involved-- your saying you "know the problem is with the DISTINCT" really limits the scope of available answers.

antlersoft
  • 14,223
  • 3
  • 28
  • 52
  • 2
    I know it limits the scope of answers, which is why I did it. I'm looking only for answers that happen at the DB level, which is what you gave me (+1). – orokusaki Jul 06 '11 at 15:32
  • 7
    An index alone isn't enough. I have an index on my distinct column, and yet the query still takes several minutes to search 8 million rows to find 4 distinct values. – Cerin Jan 27 '14 at 14:50
  • 2
    See http://stackoverflow.com/a/14732410/32453 putting the select distinct query in a subquery and counting that worked for me, bizarrely. – rogerdpack Nov 17 '14 at 22:31
7

You can try increasing the work_mem setting, depending on the size of Your dataset It can cause switching the query plan to hash aggregates, which are usually faster.

But before setting it too high globally, first read up on it. You can easily blow up Your server, because the max_connections setting acts as a multiplier to this number.

This means that if you were to set work_mem = 128MB and you set max_connections = 100 (the default), you should have more than 12.8GB of RAM. You're essentially telling the server that it can use that much for performing queries (not even considering any other memory use by Postgres or otherwise).

maniek
  • 6,557
  • 2
  • 18
  • 40