4

In Oracle 12, we finally have a limit feature, so we can simply

select distinct columnname from mytable fetch first n rows only;

However, I am presently stuck with a previous version and would like to know how to achieve this result.

Ideally the query should return rows immediately, even for a huge table. It should return results as soon as the N distinct have been found, instead of processing all rows.

Jon Heller
  • 31,663
  • 6
  • 63
  • 114
Apollys supports Monica
  • 1,979
  • 1
  • 15
  • 28
  • What do you mean by "first" in "first n rows"? Doesn't that require some ordering of rows? There is no such ordering in your example. – mathguy Apr 20 '17 at 23:51
  • No, it doesn't require an ordering. Just like `select * from mytable where rownum <= n` doesn't require an ordering. You can read the documentation and see this. – Apollys supports Monica Apr 20 '17 at 23:52
  • That will get you **some** (random) n rows. What do **you** mean by *first*? – mathguy Apr 21 '17 at 00:12
  • Why are you arguing with me? It's already defined, the function already exists. Go read the spec man. – Apollys supports Monica Apr 21 '17 at 00:21
  • The duplicate is not using a sledgehammer to crack a nut, it contains the methods you need you achieve the result you want. It'll be fine with 5 rows or 500m. – Ben Apr 21 '17 at 07:40
  • @Ben Ummm no, it wont. I just ran it and it's taking forever to select 10 rows. Learn what you're talking about before spamming my question with duplicate votes. – Apollys supports Monica Apr 21 '17 at 16:16
  • I would like to help; however, you need to help us help you. If the two standard methods of achieving what you want, on the system you're using, don't work then some additional contextual information is really useful. You've asked how to achieve the result and the answer is there. If that answer isn't good enough, for whatever reason, then by adding more details about your specific situation (for instance the DDL of the table, the current query plan, the cardinality of the column you're accessing etc.) will enable potential answerers to tailor their answer to your specific needs... contd – Ben Apr 21 '17 at 16:44
  • ... or to suggest a better duplicate. I've just run these queries on an 850m row table in 7 minutes, single-threaded - if it's taking 10 minutes for you then there must be something more that would help you get a better answer. If you're able to add this please [edit] your question with the code you're using and details about the environment I'd be happy to help. – Ben Apr 21 '17 at 16:44
  • 1
    It's probably not your question but the attitude you show in insulting the people that gets you the downvotes – a_horse_with_no_name Apr 21 '17 at 22:08
  • Yes, but isn't it great how they can insult and disrespect me to their heart's content... But the moment it's the other way around! BAM! Duplicate, downvote, bury, sayonara. – Apollys supports Monica Apr 21 '17 at 22:14
  • Nobody insulted you. _You_ are insulting everyone else. – a_horse_with_no_name Apr 21 '17 at 22:15
  • 1
    Not taking the few seconds to properly read the question and just spamming duplicate votes is very insulting. – Apollys supports Monica Apr 21 '17 at 22:16
  • So it turns out this isn't a "how to do it question", but a question on how to improve the performance of a specific query but the details necessary to answer it are missing. In order to answer that, please [edit] your question and add the `create table` statement (including all indexes) of the table in question and the execution plan. [**Formatted text**](http://stackoverflow.com/help/formatting) please, [no screen shots](http://meta.stackoverflow.com/questions/285551/why-may-i-not-upload-images-of-code-on-so-when-asking-a-question/285557#285557) – a_horse_with_no_name Apr 21 '17 at 22:17
  • Not really. The question is 100% clear, I simply want to reproduce the behavior of the query given in the second line of the post. – Apollys supports Monica Apr 21 '17 at 22:19
  • Well you _are_ whining about performance: "*It has been running for 10 minutes without finishing..*" and "*Yet it takes your method 7 minutes to return 10 row*" - but you never tell us what "your solution" is. So you apparently have a solution, but it's too slow. But I'm out of here, you apparently only want to insult people and are unwilling to give more information that is needed to solve the performance problem. – a_horse_with_no_name Apr 21 '17 at 22:30
  • 1
    If you can't understand "reproduce the behavior of X given line of code", I'm not sure what you're doing on this site. – Apollys supports Monica Apr 21 '17 at 23:00
  • @Apollys This site gets a question *almost* the same as this about once a day and it gets really tiring. 99% of the questions on here about "Top N" or "first rows" are a duplicate. It looks like you've got an interesting variation of that common problem but we have to try extra hard to separate it. (And I apologize in advance if I misunderstood your question and incorrectly edited the post - it's your question so feel free to change it back if you want.) – Jon Heller Apr 22 '17 at 04:37

2 Answers2

4

I think using

.. where rownum < XXX

should work

Perhaps

select * from (select distinct columnname from mytable) where  rownum < XXX
Ed Heal
  • 55,822
  • 16
  • 77
  • 115
4

No version of Oracle has a native syntax that returns a distinct Top N in an optimal way. This feature must be manually created using a PL/SQL pipelined function.

Sample Schema

This script creates a table with one column, about 100 million rows, that uses roughly 1GB of space.

--drop table mytable purge;

create table mytable(columnname number not null) nologging;

insert /*+ append */ into mytable
select level from dual connect by level <= 100000;
commit;

begin
    for i in 1 .. 10 loop
        insert /*+ append */ into mytable select * from mytable;
        commit;
    end loop;
end;
/

begin
    dbms_stats.gather_table_stats(user, 'MYTABLE');
end;
/

--1.25GB.
select bytes/1024/1024/1024 gb from dba_segments where segment_name = 'MYTABLE';

Oracle 12c row limiting clause does not work well with distinct.

The new 12c syntax consistently runs in about 20 seconds to return a small amount of rows:

select distinct columnname from mytable fetch first 10 rows only;

The statement reads the entire table, hashes the entire table, and then grabs the top N rows:

explain plan for
select distinct columnname from mytable fetch first 10 rows only;

select * from table(dbms_xplan.display(format => 'basic'));

Plan hash value: 239985407

------------------------------------------
| Id  | Operation              | Name    |
------------------------------------------
|   0 | SELECT STATEMENT       |         |
|   1 |  VIEW                  |         |
|   2 |   WINDOW NOSORT STOPKEY|         |
|   3 |    VIEW                |         |
|   4 |     HASH UNIQUE        |         |
|   5 |      TABLE ACCESS FULL | MYTABLE |
------------------------------------------

The Oracle 11g version created by Ed Heal below surprisingly works better! It runs in about 12 seconds.

select * from (select distinct columnname from mytable) where  rownum < 10;

Even though it's much faster, 12 seconds is still horrible. Regardless of my CPU or I/O performance, the algorithms must be wrong if it's taking seconds instead of milliseconds.

Indeed, this plan looks a little better. It has a SORT GROUP BY STOPKEY low in the plan. This stops the query before it processes everything. But it's still stopping much too late. (Maybe Oracle is still reading the entire table but only sorting part of it?)

explain plan for
select * from (select distinct columnname from mytable) where  rownum < 10;

select * from table(dbms_xplan.display(format => 'basic'));

Plan hash value: 3842480186

-------------------------------------------
| Id  | Operation               | Name    |
-------------------------------------------
|   0 | SELECT STATEMENT        |         |
|   1 |  COUNT STOPKEY          |         |
|   2 |   VIEW                  |         |
|   3 |    SORT GROUP BY STOPKEY|         |
|   4 |     TABLE ACCESS FULL   | MYTABLE |
-------------------------------------------

Pipelined Functions

This is an ugly solution for several reasons. It requires new code and objects for different result sets. It may not scale well - the function has a collection to store previous results, what happens if that collection gets huge?

Every different result type requires new objects:

--Create an object to hold a record with the result columns.
--(Not necessary for this simple example since there's only one column, but will
-- be necessary if there are multiple columns.)
create or replace type columnname_rec is object
(
    columnname number
);

--Create an object to hold a table of the records.
create or replace type columnname_tab is table of columnname_rec;

And a different function that returns a different type:

--Function that returns the distinct Top N as soon as they are found.
create or replace function fast_distinct_top_n(p_n number, p_cursor in sys_refcursor) return columnname_tab pipelined is
    v_columnname number;
    v_distinct_count number := 0;

    type previous_values_type is table of varchar2(4000) index by varchar2(4000);
    v_previous_values previous_values_type;
begin
    loop
        --Get new value.
        fetch p_cursor into v_columnname;

        --If the new value does not exist...
        if not v_previous_values.exists(v_columnname) then
            --Save the new value.
            v_previous_values(v_columnname) := v_columnname;
            --Increment the counter.
            v_distinct_count := v_distinct_count + 1;
            --Return the value
            pipe row(columnname_rec(v_columnname));
            --Exit if the counter is more than the top N.
            exit when v_distinct_count >= p_n;
        end if;
    end loop;
end;
/

But finally we have created a query that returns distinct Top N results in milliseconds.

select * from table(fast_distinct_top_n(10, cursor(select * from mytable)));

If you can't stand creating so many objects there's probably a way to make it generic using Method4. But that solution would still be complicated.

Jon Heller
  • 31,663
  • 6
  • 63
  • 114