5

We have a C# application which parses data from text files. We then have to update records in our sql database based on the information in the text files. What's the most efficient way for passing the data from application to SQL server?

We currently use a delimited string and then loop through the string in a stored procedure to update the records. I am also testing using TVP (table valued parameter). Are there any other options out there?

Our files contain thousands of records and we would like a solution that takes the least amount of time.

Anita
  • 59
  • 1
  • 2
  • 2
    Another option is BULK INSERT the file into a staging table in SQL Server (using `BCP.EXE` or `SSIS` or the `BULK INSERT` keyword) then transform it from your staging table to the final table using a stored procedure. In comparison to C#, SQL is efficient at ordering, lookups, grouping, summarising but not so good at text manipulation. What format are your text files and what kind of processing is required? – Nick.McDermaid Feb 04 '15 at 21:55
  • 1
    Thousands of records are peanuts for SQL Server, You can do this any way you like, repeat it 10x, and it would still be fast – Andomar Feb 04 '15 at 21:55
  • TVP will be more efficient than looping through a string. In my testing, it generally beats all of the other string splitting mechanisms, even CLR. However if you have an index on the string column another approach is to pass the variable as string-and-comma-delimited, e.g. `'Smith','Jones','Frankenstein'` and then assemble that in dynamic SQL (pretend there are a bunch of disclaimers here about SQL injection, and read up on that later). That said, identifying rows to update based on strings (especially if you're talking about multi-valued strings like JSON) sounds like a bad ida. – Aaron Bertrand Feb 04 '15 at 21:56
  • The mistake most people make is to try to send strings only comma-delimited, e.g. `Smith,Jones,Frankenstein` and then try to give that to an `IN(@parameter)` or construct dynamic SQL - won't work because the param is a single tuple to the `IN()` and is a set of column names to the dynamic SQL. – Aaron Bertrand Feb 04 '15 at 21:57
  • The TVP is probably more efficient for your C# code, too, after all you are probably building a comma-separated string in some inefficient way but ultimately based off something with better structure, like a DataTable, right? – Aaron Bertrand Feb 04 '15 at 21:58
  • Thanks @AaronBertrand. I am using a data table in my application which is much easier to pass to the SP. Should i put a limit on how many records i should be passing to SQL server at a time? – Anita Feb 04 '15 at 22:01
  • I don't think there is a magic number - it will all depend on a variety of factors. By default, SQL Server will usually escalate to a table lock if you touch > 5000 rows, but it could make that decision at other thresholds again depending on a lot of factors. Table lock will cause blocking and limit concurrency. Stepping back, though, I just can't envision the scenario where you need to update thousands of rows at once, and the only way you can do that is to identify each of them individually. Can you explain the real-world scenario even if a little abstractly? How did they get to a DataTable? – Aaron Bertrand Feb 04 '15 at 22:08
  • As I mentioned, we get text files in different formats from different clients. The information in these text files are not consistent (different clients send us different text files). We need specific data elements from these text files in order to update our database. So we have different parsers for different text files. We basically read one line at a time and extract the data we need. When using TVP (which I am still testing with), I just add the data elements to my data table and then call the SP when i am done parsing. The text files are HUGE and could contain more than 50K records. – Anita Feb 04 '15 at 22:16
  • You could write stored procedures in your database to do it with 1 input parameter, which is xml. Then send the whole xml file to the stored proc, (pending size) – Ryan Mann Feb 04 '15 at 22:44
  • In a similar scenario I used the datatable as the output of the parser and then a simple SqlCommand with named parameters (to prevent injection) and sent about 200 rows in each command, iterating through the datatable. 10K rows are updated within a minute (from file upload, so including parsing and encryption of some fields in each of these 10k rows). – zaitsman Feb 04 '15 at 22:49
  • Do your files contains thousands of records or > 50K records? You should probably update your original question – Nick.McDermaid Feb 05 '15 at 03:43
  • I suggest that you try to make 1 update a time to sqlserver but parallelize the c# code (Parallel.Foreach for example). My experience is that its much more scalable to do a lots of small calls than 1 huge call. And Large TVP:s are really slow. – Daniel Stackenland Feb 06 '15 at 07:19
  • I am limiting my TVP to 200 records at a time. I am seeing a huge improvement in performance compared to our existing code which uses a delimited string. As @zaitsman mentioned, 200 rows is a good size. – Anita Feb 06 '15 at 19:10

2 Answers2

2

Please do not use a DataTable as that is just wasting CPU and memory for no benefit (other than possibly familiarity). I have detailed a very fast and flexible approach in my answer to the following questions, which is very similar to this one:

How can I insert 10 million records in the shortest time possible?

The example shown in that answer is for INSERT only, but it can easily be adapted to include UPDATE. Also, it uploads all rows in a single shot, but that can also be easily adapted to set a counter for X number of records and to exit the IEnumerable method after that many records have been passed in, and then close the file once there are no more records. This would require storing the File pointer (i.e. the stream) in a static variable to keep passing to the IEnumerable method so that it can be advanced and picked up at the most recent position the next time around. I have a working example of this method shown in the following answer, though it was using a SqlDataReader as input, but the technique is the same and requires very little modification:

How to split one big table that has 100 million data to multiple tables?

And for some perspective, 50k records is not even close to "huge". I have been uploading / merging / syncing data using the method I am showing here on 4 million row files and that hit several tables with 10 million (or more) rows.


Things to not do:

  • Use a DataTable: as I said, if you are just filling it for the purpose of using with a TVP, it is a waste of CPU, memory, and time.
  • Make 1 update at a time in parallel (as suggested in a comment on the question): this is just crazy. Relational database engines are heavily tuned to work most efficiently with sets, not singleton operations. There is no way that 50k inserts will be more efficient than even 500 inserts of 100 rows each. Doing it individually just guarantees more contention on the table, even if just row locks (it's 100k lock + unlock operations). Is could be faster than a single 50k row transaction that escalates to a table lock (as Aaron mentioned), but that is why you do it in smaller batches, just so long as small does not mean 1 row ;).
  • Set the batch size arbitrarily. Staying below 5000 rows is good to help reduce chances of lock escalation, but don't just pick 200. Experiment with several batch sizes (100, 200, 500, 700, 1000) and try each one a few times. You will see what is best for your system. Just make sure that the batch size is configurable though the app.config file or some other means (table in the DB, registry setting, etc) so that it can be changed without having to re-deploy code.
  • SSIS (powerful, but very bulky and not fun to debug)

Things which work, but not nearly as flexible as a properly done TVP (i.e. passing in a method that returns IEnumerable<SqlDataRecord>). These are ok, but why dump the records into a temp table just to have to parse them into the destination when you can do it all inline?

  • BCP / OPENROWSET(BULK...) / BULK INSERT
  • .NET's SqlBulkCopy
Community
  • 1
  • 1
Solomon Rutzky
  • 41,664
  • 6
  • 112
  • 149
  • I do complex ETL processes and I love SSIS, but have to agree with you about not fun to debug. – HLGEM Feb 06 '15 at 21:38
  • @HLGEM Don't get me wrong, I definitely respect the capabilities of SSIS, especially with being able to do parallel tasks and write custom C# / VB.Net tasks, but unless you work with SSIS regularly and are hence familiar with its nuances, it is a steep learning curve (and hence very time consuming) that is usually not worth it if a project will be the only thing to use it. At least that has been my experience. It has not been easy to maintain / manage those projects. – Solomon Rutzky Feb 06 '15 at 21:45
  • I agree with the steep learning curve too. A great tool but very complex. When we interview, we ask what they don't like about SSIS, that answer will tell us if they have ever really used it! Anybody who says "nothing" is assumed to be a newbie with the product. – HLGEM Feb 06 '15 at 22:11
  • @HLGEM That's a good and clever technique. I like it. I do something similar, in a more general sense, by asking candidates what they think of cursors and/or triggers. The response usually not only indicates that they have experience with the issue and have actually thought about it vs repeating what they have read, but also how well they can communicate. – Solomon Rutzky Feb 09 '15 at 21:54
  • I could talk to you for hours about Triggers and cursors.The good, the bad, the use cases, the places to avoid using them, the real life examples of things that have gone wrong (can you say cursor ina trigger?) and the real life exlamples of things that went right and how to do debug the trigger, etc. The worse things that can happen if you avoid using a trigger when it is appropriate, etc. I'd never get out of the interview. – HLGEM Feb 09 '15 at 22:00
0

The best way to do this in my opinion is to create a temp table then use SqlBulkCopy to insert into that temp table (https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy%28v=vs.110%29.aspx), and then simply Update the table based on the temp table.

Based on my tests(using Dapper and also LINQ), updating as a bulk or with batches takes way too longer than just creating a temp table and sending a command to the server to update the data based on the temp table. The process is faster because the SqlBulkCopy populates the data natively in a fast manner, and the rest is completed on the SQL server side which goes through less calculation steps, and the data at that point resides on the server end.

Hozikimaru
  • 1,075
  • 1
  • 9
  • 19