4

I have a lot of data which needs to be paired based on a few simple criteria. There is a time window (both records have a DateTime column), if one record is very close in time (within 5 seconds) to another then it is a potential match, the record which is the closest in time is considered a complete match. There are other fields which help narrow this down also.

I wrote a stored procedure which does this matching on the server before returning the full, matched dataset to a C# application. My question is, would it be better to pull in the 1 million (x2) rows and deal with them in C#, or is sql server better suited to perform this matching? If Sql server is, then what is the fastest way of pairing data using datetime fields?

Right now I select all records from Table 1/Table 2 into temporary tables, iterate through each record in Table 1, look for a match in Table 2 and store the match (if one exists) in a temporary table, then I delete both records in their own temporary tables.

I had to rush this piece for a game I'm writing, so excuse the bad (very bad) procedure... It works, it's just horribly inefficient! The whole SP is available on pastebin: http://pastebin.com/qaieDsW7

I know the SP is written poorly, so saying "hey, dumbass... write it better" doesn't help! I'm looking for help in improving it, or help/advice on how I should do the whole thing differently! I have about 3/5 days to rewrite it, I can push that deadline back a bit, but I'd rather not if you guys can help me in time! :)

Thanks!

Faraday
  • 2,806
  • 2
  • 20
  • 45
  • 3
    You absolutely need to experiment. Preferably on a smaller, more tractable test set. But my *guess* is that, if you do it right, SQL will be the clear winner... IMHO... – paulsm4 May 29 '12 at 21:29
  • 2
    I have done something similar in the past, where I had to match records based on some sort of sliding window (albeit I had to work across different data sources with unsynchronized time sources [off by several minutes]). Anyway, the easiest way I found to do this is to first insert all your data into a temp table in ascending time order (like you're doing), and then make modifications to the table by selecting from it and joining it back to itself to determine the update values. If you have questions about this technique, hit me up in [SO C# Chat](http://chat.stackoverflow.com/rooms/7/c) – SPFiredrake May 29 '12 at 21:33
  • 1
    Maybe you could create a table-valued function to do the matching and use a `CROSS APPLY` with the first table? In any case I'd imagine doing this on the server will be much more efficient. – Lee May 29 '12 at 21:34

4 Answers4

2

If the data is already in the database, that is where you should do the work. You absolutely should learn to display and query plans using SQL Server Management Studio, and become able to notice and optimize away expensive computations like nested loops.

Your task probably does not require any use of temporary tables. Temporary tables tend to be efficient when they are relatively small and/or heavily reused, which is not your case.

Jirka Hanika
  • 12,574
  • 3
  • 39
  • 68
  • OK, so how would you go about finding the bottlenecks in this SP? I've heard of query plans, but I've not had to use them before as my SPs have always be VERY simple! Can you give me an idiots guide? :) – Faraday May 29 '12 at 21:57
  • 1
    Drop everything and Google for "sql server show plan" or "query performance"!!!!! Now. Please - it's important! IMHO... – paulsm4 May 29 '12 at 22:43
2

Ultimately, compiling your your data on the database side is preferable 99% of the time, as it's designed for data crunching (through the use of indexes, relations, etc). A lot of your code can be consolidated by the use of joins to compile the data in exactly the format you need. In fact, you can bypass almost all your temp tables entirely and just fill a master Event temp table.

The general pattern is this:

INSERT INTO #Events
SELECT <all interested columns>
FROM 
    FireEvent
    LEFT OUTER JOIN HitEvent ON <all join conditions for HitEvent>

This way you match all fire events to zero or more HitEvents. After our discussion in chat, you can even limit it to zero or one hit event by wrapping it in a subquery and using a window function for ROW_NUMBER() OVER (PARTITION BY HitEvent.EventID ORDER BY ...) AS HitRank and add a WHERE HitRank = 1 to the outer query. This is ultimately what you ended up doing and got the results you were expecting (with a bit of work and learning in the process).

SPFiredrake
  • 3,692
  • 15
  • 26
1

I would advise you to try to optimize the stored procedure if is not running fast enough and not rewrite it in C#. Why would you want to transfer millions of rows out of SQL Server anyway?

Unfortunately I don't have an SQL Server installation so I can't test your script, but I don't see any CREATE INDEX statements in there. If you didn't just skipped them for brevity, then you should surely analyze your queries and see which indexes are needed.

mavroprovato
  • 7,085
  • 3
  • 34
  • 51
1

So the answer depends on several factors like resources available per client/server (Ram/CPU/Concurrent Users/Concurrent processes, etc.)

Here are some basic rules that will improve your performance regardless of what you use:

  • Loading a million rows into c# program is not a good practice. Unless this is a stand alone process with plenty of ram.
  • Uniqueidentifiers will never out perform Integers. Comparisons
  • Common Table Expression are a good alternative for fast performing matching. How to use CTE
  • Finally you have to consider output. If there is constant reading and writing that affects the user interface, then you should manage that in memory (c#), otherwise all CRUD operations should be kept inside the database.
Community
  • 1
  • 1
Internet Engineer
  • 2,414
  • 7
  • 34
  • 51