4

I have a SQL Server 2008 database with approximately 14 millions rows. In it there are two tables

Table1
    rowId int, primary key
    someData1 int
    someData2 int...

Table2
    id int, primary key
    rowId ==> int, refers to the rowId from Table1
    someCalculatedData int...

    Table2.rowId is not a foreign key, but I did make a Non-clustered, Non-Unique index on it

When I import the data I insert all the data into Table1 and then find the rows that don't have an entry in Table2 and insert data into this table.

I was doing this in batches by selecting 250,000 rows at a time, processing the data and inserting them into Table2, and then finding the next 250,000 rows and so on until they were no rows in Table1 that don't have any entry in Table2.

select TOP 250000 rowId from Table1 
       where NOT EXISTS (select rowId from Table2 where Table1.rowId=Table2.rowId)

This query has really slowed down and now it takes over two minutes to get the next batch. If I run the query without the TOP or SET ROWCOUNT keywords and get all of the rows, then the query returns the results in ~15 seconds.

Does anyone know why TOP and SET ROWCOUNT cause the query to take so much longer then getting all of the data?

Can I improve the performance of the query and still only get a subset of the data each time?

nick
  • 3,198
  • 4
  • 21
  • 26
  • 1
    Does the query returs ALL the rows in ~15 seconds, or it *starts* returning rows, with the first one after 15 seconds? – Remus Rusanu Nov 18 '09 at 23:11
  • If I run the query in Sql Server Management Studio I can see some results after 5-8 seconds and all of them by 15 seconds without the TOP keyword. If I include the TOP keyword I don't see anything for 2+ minutes – nick Nov 18 '09 at 23:15
  • It can'tspossibly return 14 mil rows in 15 seconds, it means there must be only very few left that satisfy the insert criteria. Why not just insert them, w/o the top clause? – Remus Rusanu Nov 18 '09 at 23:20
  • Right now I have ~1.8 million rows that are in Table1 but not Table2. If I run the query locally/on the same box in SSMS I get 1.8 million rows (with only the rowId column) back in 15 seconds. – nick Nov 18 '09 at 23:27
  • I'm not sure why the TOP introduces such discrepancy. Obviosuly a change of plan occurs and I suspect that either a merge-join is replaced by a nested loop join and/or the MAXDOP of the query changes. Is it possible to post the two query plans, the slow and the fast one? – Remus Rusanu Nov 18 '09 at 23:46
  • I just rebuilt/reorganized the index on Table2.rowId from the wizard (so I don't know exactly what I did) and now the TOP query is more performant than selecting all of the data. The avg_fragmentation_in_percent value for this index was only 0.67 which according to books online is less than what I should have worried about. I guess I need to look into how the indexes maintain themselves over time. – nick Nov 18 '09 at 23:56
  • Most likely the update on statistics did the trick. Outdated statistics would cause the optimizer to choose the wrong leg of a join, believing it will have fewer rows to scan. If you did frequent updates to the table and auto-update statistics is OFF, is likely the statistics got off whack. – Remus Rusanu Nov 18 '09 at 23:59

3 Answers3

3

see if this helps

select top 250000 t1.rowid 
from Table1 t1 
left outer join table2  t2 
on t1.rowid=t2.rowid 
where t2.rowid is null
C B
  • 1,637
  • 6
  • 16
  • 20
ps.
  • 3,719
  • 5
  • 28
  • 39
  • This a good suggestion. "Not exists" especially in a subquery can be slow. A "left join where null" will give you the same thing. – DanO Nov 18 '09 at 23:35
  • Exactly what I'd recommend!!! Depending on what indexes you have set this switch can make quite a bit of a difference. – KSimons Nov 18 '09 at 23:42
1

This occurs because all of the query needs to run on the server before the server can decide which the "top" 250,000 rows are. Only then will the data start coming back to your client over the network.

Without the "top" statement the server will start sending data immediately, though I suspect the total time for all the data to be sent from the server and received by your machine is likely to be similar to the statement including the top.

Sonny Boy
  • 7,244
  • 16
  • 67
  • 101
  • Unfortunately not. Whether it's the server working throught the 14 million rows or your client machine doing it it still has to be done. – Sonny Boy Nov 18 '09 at 23:25
  • The query seems acceptably fast both locally and remotely if I select all the rowIDs in Table1 but not in Table2. If I try to get any subset of them then it takes a lot longer. – nick Nov 18 '09 at 23:29
1

Apparently my index or the index statistics on Table2.rowId weren't fresh and the query plan for the TOP/SET ROWCOUNT queries to perform poorly.

I reorganized/rebuilt the index and the query performance for the TOP/SET ROWCOUNT queries was greatly improved.

nick
  • 3,198
  • 4
  • 21
  • 26
  • 1
    If anyone can expound on the nuances of what was going on and/or how to diagnose/more quickly identify these issues please share :) – nick Nov 19 '09 at 00:10