33

I have a file (which has 10 million records) like below:

    line1
    line2
    line3
    line4
   .......
    ......
    10 million lines

So basically I want to insert 10 million records into the database. so I read the file and upload it to SQL Server.

C# code

System.IO.StreamReader file = 
    new System.IO.StreamReader(@"c:\test.txt");
while((line = file.ReadLine()) != null)
{
    // insertion code goes here
    //DAL.ExecuteSql("insert into table1 values("+line+")");
}

file.Close();

but insertion will take a long time. How can I insert 10 million records in the shortest time possible using C#?

Update 1:
Bulk INSERT:

BULK INSERT DBNAME.dbo.DATAs
FROM 'F:\dt10000000\dt10000000.txt'
WITH
(

     ROWTERMINATOR =' \n'
  );

My Table is like below:

DATAs
(
     DatasField VARCHAR(MAX)
)

but I am getting following error:

Msg 4866, Level 16, State 1, Line 1
The bulk load failed. The column is too long in the data file for row 1, column 1. Verify that the field terminator and row terminator are specified correctly.

Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.

Msg 7330, Level 16, State 2, Line 1
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".

Below code worked:

BULK INSERT DBNAME.dbo.DATAs
FROM 'F:\dt10000000\dt10000000.txt'
WITH
(
    FIELDTERMINATOR = '\t',
    ROWTERMINATOR = '\n'
);
Solomon Rutzky
  • 41,664
  • 6
  • 112
  • 149
MD TAHMID HOSSAIN
  • 1,473
  • 5
  • 27
  • 52
  • 3
    Can;t you import this straight with SQL server import tools? Why use C# for this? – gjvdkamp Sep 10 '14 at 16:10
  • 31
    Uh, take a look at BULK INSERT, BCP, SqlBulkCopy, etc. The absolute worst possible way you could do this is insert 10 million rows, one at a time, in a loop. – Aaron Bertrand Sep 10 '14 at 16:10
  • 1
    Have you tried `BULK INSERT` - http://msdn.microsoft.com/en-gb/library/ms188365.aspx? – Kami Sep 10 '14 at 16:11
  • use XML I can do it without Turning on BULK INSERT – MethodMan Sep 10 '14 at 16:15
  • 2
    @AaronBertrand worse than 1 at a time from a gui? :) – Daniel E. Sep 10 '14 at 16:17
  • Create an SSIS package, or SqlBulkCopy class from they system.data.sqlclient namespace. – kidshaw Sep 10 '14 at 16:17
  • Don't you have any separator per line? Is it really one very long line to insert in only one column? If so, then make your column perhaps a `varchar(max)`. What is your table structure into which you want to insert the lines, and what does your lines look like? – Will Marcouiller Sep 10 '14 at 16:43
  • @WillMarcouiller separator per line is new line. my table looks like table(data varchar(max)). line1 looks like something line2 looks like something2 – MD TAHMID HOSSAIN Sep 10 '14 at 17:01
  • @MDTAHMIDHOSSAIN Keep in mind that there can be other factors that will affect how fast you can read/write data for instance if its an old SATA2 HD, and other things. – Prix Sep 10 '14 at 17:10

4 Answers4

48

Please do not create a DataTable to load via BulkCopy. That is an ok solution for smaller sets of data, but there is absolutely no reason to load all 10 million rows into memory before calling the database.

Your best bet (outside of BCP / BULK INSERT / OPENROWSET(BULK...)) is to stream the contents from the file into the database via a Table-Valued Parameter (TVP). By using a TVP you can open the file, read a row & send a row until done, and then close the file. This method has a memory footprint of just a single row. I wrote an article, Streaming Data Into SQL Server 2008 From an Application, which has an example of this very scenario.

A simplistic overview of the structure is as follows. I am assuming the same import table and field name as shown in the question above.

Required database objects:

-- First: You need a User-Defined Table Type
CREATE TYPE ImportStructure AS TABLE (Field VARCHAR(MAX));
GO

-- Second: Use the UDTT as an input param to an import proc.
--         Hence "Tabled-Valued Parameter" (TVP)
CREATE PROCEDURE dbo.ImportData (
   @ImportTable    dbo.ImportStructure READONLY
)
AS
SET NOCOUNT ON;

-- maybe clear out the table first?
TRUNCATE TABLE dbo.DATAs;

INSERT INTO dbo.DATAs (DatasField)
    SELECT  Field
    FROM    @ImportTable;

GO

C# app code to make use of the above SQL objects is below. Notice how rather than filling up an object (e.g. DataTable) and then executing the Stored Procedure, in this method it is the executing of the Stored Procedure that initiates the reading of the file contents. The input parameter of the Stored Proc isn't a variable; it is the return value of a method, GetFileContents. That method is called when the SqlCommand calls ExecuteNonQuery, which opens the file, reads a row and sends the row to SQL Server via the IEnumerable<SqlDataRecord> and yield return constructs, and then closes the file. The Stored Procedure just sees a Table Variable, @ImportTable, that can be access as soon as the data starts coming over (note: the data does persist for a short time, even if not the full contents, in tempdb).

using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.IO;
using Microsoft.SqlServer.Server;

private static IEnumerable<SqlDataRecord> GetFileContents()
{
   SqlMetaData[] _TvpSchema = new SqlMetaData[] {
      new SqlMetaData("Field", SqlDbType.VarChar, SqlMetaData.Max)
   };
   SqlDataRecord _DataRecord = new SqlDataRecord(_TvpSchema);
   StreamReader _FileReader = null;

   try
   {
      _FileReader = new StreamReader("{filePath}");

      // read a row, send a row
      while (!_FileReader.EndOfStream)
      {
         // You shouldn't need to call "_DataRecord = new SqlDataRecord" as
         // SQL Server already received the row when "yield return" was called.
         // Unlike BCP and BULK INSERT, you have the option here to create a string
         // call ReadLine() into the string, do manipulation(s) / validation(s) on
         // the string, then pass that string into SetString() or discard if invalid.
         _DataRecord.SetString(0, _FileReader.ReadLine());
         yield return _DataRecord;
      }
   }
   finally
   {
      _FileReader.Close();
   }
}

The GetFileContents method above is used as the input parameter value for the Stored Procedure as shown below:

public static void test()
{
   SqlConnection _Connection = new SqlConnection("{connection string}");
   SqlCommand _Command = new SqlCommand("ImportData", _Connection);
   _Command.CommandType = CommandType.StoredProcedure;

   SqlParameter _TVParam = new SqlParameter();
   _TVParam.ParameterName = "@ImportTable";
   _TVParam.TypeName = "dbo.ImportStructure";
   _TVParam.SqlDbType = SqlDbType.Structured;
   _TVParam.Value = GetFileContents(); // return value of the method is streamed data
   _Command.Parameters.Add(_TVParam);

   try
   {
      _Connection.Open();

      _Command.ExecuteNonQuery();
   }
   finally
   {
      _Connection.Close();
   }

   return;
}

Additional notes:

  1. With some modification, the above C# code can be adapted to batch the data in.
  2. With minor modification, the above C# code can be adapted to send in multiple fields (the example shown in the "Steaming Data..." article linked above passes in 2 fields).
  3. You can also manipulate the value of each record in the SELECT statement in the proc.
  4. You can also filter out rows by using a WHERE condition in the proc.
  5. You can access the TVP Table Variable multiple times; it is READONLY but not "forward only".
  6. Advantages over SqlBulkCopy:
    1. SqlBulkCopy is INSERT-only whereas using a TVP allows the data to be used in any fashion: you can call MERGE; you can DELETE based on some condition; you can split the data into multiple tables; and so on.
    2. Due to a TVP not being INSERT-only, you don't need a separate staging table to dump the data into.
    3. You can get data back from the database by calling ExecuteReader instead of ExecuteNonQuery. For example, if there was an IDENTITY field on the DATAs import table, you could add an OUTPUT clause to the INSERT to pass back INSERTED.[ID] (assuming ID is the name of the IDENTITY field). Or you can pass back the results of a completely different query, or both since multiple results sets can be sent and accessed via Reader.NextResult(). Getting info back from the database is not possible when using SqlBulkCopy yet there are several questions here on S.O. of people wanting to do exactly that (at least with regards to the newly created IDENTITY values).
    4. For more info on why it is sometimes faster for the overall process, even if slightly slower on getting the data from disk into SQL Server, please see this whitepaper from the SQL Server Customer Advisory Team: Maximizing Throughput with TVP
Solomon Rutzky
  • 41,664
  • 6
  • 112
  • 149
  • Take a look at my suggestion, it is more elegant – playful Sep 11 '14 at 15:01
  • @CesarBoucas: your suggestion is definitely interesting and more efficient than pre-loading a DataTable. It does make use of the existing `SqlBulkCopy`, which can be nice, but it can also be limiting. While your method does allow for transformation/validation in the IDataReader class, it doesn't allow for any control at the database, nor does it allow for returning data to the calling process (I just added a note at the bottom about that). – Solomon Rutzky Sep 11 '14 at 15:41
  • @srutzky Let's try to focus on the question scope. It is just about loading data from a file to a table, it doesn't mention data transformation/validation/return and so on. Keep it simple ;) – playful Sep 11 '14 at 17:10
  • @CesarBoucas: My suggestion and the example code is focused on the question as stated. I have only pointed out as additional notes that there are many additional advantages and options. I am not saying that your solution is not appropriate, just that it is at most barely any more elegant, and that one gives up a lot in order to get that marginal increase, if there even is any, in elegance. Though again, there are cases (e.g. not being able to add the DB objects and/or needing a drop-in replacement for a process that already dumps into an import table) where your suggestion is ideal. – Solomon Rutzky Sep 11 '14 at 19:23
  • Thank @SolomonRutzky I also used your approach and take fully fancy TVP stuff several times. That work perfect for me. – cat_minhv0 Aug 09 '18 at 06:26
  • @SolomonRutzky Sorry for the very late comment, but how can your approach can be done if I am using Dapper? I mean how do I pass the IEnumerable to Dapper's Execute method for example? Thank you. – geeko Apr 20 '20 at 22:06
  • @SolomonRutzky Also, can your approach be used without stored procedures (i.e. streaming TVP's to normal queries directly) ? – geeko Apr 20 '20 at 22:18
  • Hi @geeko . Not sure about Dapper as I don't use it. But yes, you certainly should be able to use TVPs with regular parameterized SQL. In that scenario, the parameter name is required to have the `@` prefix, and you need to fill out the `SqlParameter.TypeName` property (neither are required when calling a stored proc). – Solomon Rutzky Apr 20 '20 at 22:32
  • @SolomonRutzky Thank you for your answer. Just to confirm: using TVPs wth normal parameterized queries will also result into streaming the TVPs and the queries will start executing and seeing the TVPs as soon the first record is streamed? – geeko Apr 20 '20 at 23:12
  • @geeko It's been a long time since I did the testing and don't have time to do it now, but it really should be the same. So, to whatever degree the proc runs while the data is loading, the same _should_ be true of the parameterized SQL. But technically I'm not sure if the query or proc executes prior to the data being fulled transferred. – Solomon Rutzky Apr 20 '20 at 23:18
  • Can you elaborate on "With some modification, the above C# code can be adapted to batch the data in." ? Isn't the TVP streaming already "batching" when sending to the server? Or were you referring to reading in batches from the input file? – muusbolla Mar 16 '21 at 11:53
  • @muusbolla No, the TVP isn't batching anything; I was indeed referring to reading from the file in batches. The code in my answer opens the file, dumps _all_ of it into the stored proc in one shot, then closes the file. I haven't looked at this in a long time, but I believe I was referring to a pattern where you: open the file outside of `GetFileContents()`, execute the proc in a loop (passing in the fileHandle to `GetFileContents(fileHandle, batchSize)` ) until File.EOF, then close the file. This allows for smaller transactions. I'll try to post an example on my blog soon and link to it here. – Solomon Rutzky Mar 25 '21 at 22:08
6

In C#, the best solution is to let the SqlBulkCopy reads the file. To do this you need to pass an IDataReader direct to SqlBulkCopy.WriteToServer method. Here is an example: http://www.codeproject.com/Articles/228332/IDataReader-implementation-plus-SqlBulkCopy

playful
  • 1,634
  • 15
  • 14
  • 1
    This is definitely a good solution. But "best" is a matter of what someone is trying to accomplish. Using `SqlBulkCopy` limits what can be done as it only allows for the data to be inserted into a table. On the other hand, a TVP allows anything to do done with the data before it is inserted, after it is inserted, merged instead of inserted, additional operations to be performed (e.g. truncating the destination table first), etc. Also, as I just added to the bottom of my "additional notes" section, a TVP allows for passing results back to the app whereas that is not possible via `SqlBulkCopy`. – Solomon Rutzky Sep 11 '14 at 16:02
  • 3
    @srutzky Assuming that the question that was made was "how to insert 10 million entry in shortest time possible [C#]?". The solution I presented is the solution that best fits with the question, in other words, it is the best solution for the proposed question. Because, as you said: SqlBulkCopy only allows for the data to be inserted into a table - and this is exactly the question scope. – playful Sep 11 '14 at 17:01
  • I get what you are saying, but "best" is still to be determined by the OP. It is possible that the question scope is incorrectly stated due to the OP not knowing that what I am describing is possible. I have run into this plenty of times where someone has phrased the question too narrowly but didn't know it at the time. Loading the data in the fastest way might actually be a slower overall solution depending on what is being done once the data is there. Besides, what I have suggested is at most only slightly more complicated than what you are suggesting, yet has a great deal of flexibility. – Solomon Rutzky Sep 11 '14 at 19:12
4

the best way is a mix between your 1st solution and 2nd, create DataTable and in the loop add rows to it then use BulkCopy to upload to DB in one connection use this for help in bulk copy

one other thing to pay attention that bulk copy is a very sensitive operation that almost every mistake will void the copy, such if you declare the column name in the dataTable as "text" and in the DB its "Text" it will throw an exception, good luck.

Community
  • 1
  • 1
Liran
  • 543
  • 3
  • 10
  • 1
    Just FYI, this solution is fine for _much_ smaller data sets. The problem is the high memory cost as the data size grows. In the case of 10 million rows, assuming 10 characters per row is 100 MB of data in memory, right? Well, it depends on how the data is stored in memory. This case appears to be only strings which is 2 bytes per character as .Net strings are UCS-2. Hence the 10 million row file takes up 200 MB of memory in the `DataTable` before the first row can be sent to the database. – Solomon Rutzky Sep 11 '14 at 15:18
-5

If you want to insert 10 million records in the shortest time to direct using SQL query for testing purpose you should use this query

 CREATE TABLE TestData(ID INT IDENTITY (1,1), CreatedDate DATETIME)
 GO

 INSERT INTO TestData(CreatedDate) SELECT GetDate()
 GO 10000000