TOP slows down query

Question

I have a pivot query on a table with millions of rows. Running the query normally, it runs in 2 seconds and returns 2983 rows. If I add TOP 1000 to the query it takes 10 seconds to run.

What could be causing this?

SELECT * 
  FROM (SELECT l.PatientID,
               l.LabID,
               l.Result
          FROM dbo.Labs l
          JOIN (SELECT MAX(LabDate) maxDate, 
                       PatientID, 
                       LabID 
                  FROM dbo.Labs 
              GROUP BY PatientID, LabID) s ON l.PatientID = s.PatientID
                                          AND l.LabID = s.LabID
                                          AND l.LabDate = s.maxDate) A
 PIVOT(MIN(A.Result) FOR A.LabID IN ([1],[2],[3],[4],[5],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17])) p

Execution plans:

This alternate formulation has the same problem:

select
    * 
FROM (
    SELECT 
        l.PatientID,
        l.LabID,
        l.Result
    FROM dbo.Labs l
    where l.LabDate = (
        select 
            MAX(LabDate) 
        from Labs l2 
        where l2.PatientID = l.PatientID 
            and l2.LabID = l.LabID
    )
) A
PIVOT(MIN(A.Result) FOR A.LabID IN ([1],[2],[3],[4],[5],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17])) p

Have you looked at the query plans for each version of the query? — Tom H, Jul 13 '11 at 15:58
The TOP version has two clustered index scans, the regular version has one. I have no idea what this means. — τεκ, Jul 13 '11 at 16:00

Quassnoi · Accepted Answer · 2011-07-13T20:11:07.170

4

SELECT  TOP 1000
        *
FROM    (
        SELECT  patientId, labId, result,
                DENSE_RANK() OVER (PARTITION BY patientId, labId ORDER BY labDate DESC) dr
        FROM    labs
        ) q
PIVOT   (
        MIN(result)
        FOR
        labId IN ([1],[2],[3],[4],[5],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17])
        ) p
WHERE   dr = 1
ORDER BY
        patientId

You may also try creating an indexed view like this:

CREATE VIEW
        v_labs_patient_lab
WITH SCHEMABINDING
AS
SELECT  patientId, labId, COUNT_BIG(*) AS cnt
FROM    dbo.labs
GROUP BY
        patientId, labId

CREATE UNIQUE CLUSTERED INDEX
        ux_labs_patient_lab
ON      v_labs_patient_lab (patientId, labId)

and use it in the query:

SELECT  TOP 1000
        *
FROM    (
        SELECT  lr.patientId, lr.labId, lr.result
        FROM    v_labs_patient_lab vl
        CROSS APPLY
                (
                SELECT TOP 1 WITH TIES
                       result
                FROM   labs l
                WHERE  l.patientId = vl.patientId
                       AND l.labId = vl.labId
                ORDER BY
                       l.labDate DESC
                ) lr
        ) q
PIVOT   (
        MIN(result)
        FOR
        labId IN ([1],[2],[3],[4],[5],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17])
        ) p
ORDER BY
        patientId

edited Jul 13 '11 at 20:11

answered Jul 13 '11 at 16:10

Quassnoi

381,935
83
584
593

That does not retrieve the same result. It seems to split the records out by LabDate. http://i.imgur.com/8xlmn.png – τεκ Jul 13 '11 at 16:39
This works. It runs in 4 seconds with or without the TOP. I am still curious as to why TOP slows down the original query – τεκ Jul 13 '11 at 16:59
1

@τεκ: Possibly SQL Server is not consistenly good at figuring out an optimal query plan for queries that use the combination of `TOP+ORDER BY`, `JOIN`, and `PIVOT`. – Andriy M Jul 13 '11 at 17:25
Is there a way to suggest a better query plan? I prefer my original method to ranking functions if viable. – τεκ Jul 13 '11 at 18:15
1

@τεκ: your first plan uses `segment` / `top` (as you can see there is no join at all in the plan). This requires scanning the whole table. The second plan contains the actual join but since `patientId, labId, labDate` are (most probably) the leading part of the `PRIMARY KEY` (I'm assuming this because I see no sort before stream aggregate and merge join) this is more scalable because it has zero response time and hence is linear to the number of records in `TOP`. – Quassnoi Jul 13 '11 at 20:03
The indexed view/cross apply method works perfectly, runs as fast as the hash join for retrieving the whole table, and faster when retrieving less. – τεκ Jul 14 '11 at 14:35

score 2 · Answer 2 · edited Jul 13 '11 at 17:19

2

There is a specific order in which queries are processed.

A normal SQL query will be written as follows:

SELECT [...]
  FROM [table1]
  JOIN [table2]
    ON [condition]
 WHERE [...]
 GROUP BY [...]
HAVING [...]
 ORDER BY [...]

But the processing order is different:

FROM [table1]
    ON [condition]
  JOIN [table2]
 WHERE [...]
 GROUP BY [...]
HAVING [...]
SELECT [...]
 ORDER BY [...]

When using SELECT DISTINCT [...] or SELECT TOP [...] the processing order will be as follows:

FROM [table1]
    ON [condition]
  JOIN [table2]
 WHERE [...]
 GROUP BY [...]
HAVING [...]
SELECT [...] DISTINCT[...]
ORDER BY [...]
TOP [....]

Hence it's taking longer as your SELECT TOP 1000 is processed last.

Take a look at this link for further details: http://blogs.msdn.com/b/sqlqueryprocessing/

edited Jul 13 '11 at 17:19

Andriy M

71,352
17
87
142

answered Jul 13 '11 at 16:12

bombus1700

98
1
8

Five times as long? If TOP 1000 is processed last, wouldn't that mean it would just take the first 1000 results after retrieving all of them? I doubt it does that. – τεκ Jul 13 '11 at 16:29
@τεκ to order them properly it will have to retrieve all of them. – StuperUser Aug 17 '11 at 16:31

τεκ · Answer 3 · 2011-07-13T19:41:50.420

After doing some googling about suggesting an execution plan, I found the solution.

SELECT TOP 1000 * 
FROM (SELECT l.PatientID,
               l.LabID,
               l.Result
          FROM dbo.Labs l
          JOIN (SELECT MAX(LabDate) maxDate, 
                       PatientID, 
                       LabID 
                  FROM dbo.Labs 
              GROUP BY PatientID, LabID) s ON l.PatientID = s.PatientID
                                          AND l.LabID = s.LabID
                                          AND l.LabDate = s.maxDate) A
PIVOT(MIN(A.Result) FOR A.LabID IN ([1],[2],[3],[4],[5],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17])) p
OPTION (HASH JOIN)

OPTION (HASH JOIN) being the thing. The resulting execution plan for the version with TOP looks like the original non-top one, with a TOP tacked on at the end.

Since I was originally doing this in a view what I actually ended up doing was changing JOIN to INNER HASH JOIN

TOP slows down query

3 Answers3

Linked