0

Using C# I'm trying to insert from csv file (~ 55000 records) to a postgresql table with 350 columns. I'm just building the insert statement from header and record which are comma separated. Like this:

INSERT INTO public.table(field1,fields2,field3, ..... ,field350) VALUES(?,?,?,?,?,.......,?);
INSERT INTO public.table(field1,fields2,field3, ..... ,field350) VALUES(?,?,?,?,?,.......,?);
INSERT INTO public.table(field1,fields2,field3, ..... ,field350) VALUES(?,?,?,?,?,.......,?);
INSERT INTO public.table(field1,fields2,field3, ..... ,field350) VALUES(?,?,?,?,?,.......,?);
....etc

I tried using batch, like gathering 1000 statements and execute them in a transaction, but this takes for 1000 records up to ~3 sec.

I tried to follow the example from Microsoft here in which I call ExecuteNonQuery() on each line but I commit the transaction after 1000 records and begin a new transcation and this is taking up to ~3 sec per 1000 records as well. Something like this:

foreach (var line in ReadCsvFileToList(dataFile))
{
    try
    {
        if (firstLine)
        {
            header = line;
            firstLine = false;
            continue;
        }
        else
        {
            var formattedLine = line.Replace("\"\"", "NULL").Replace("'", "''").Replace("\"", "'");
            var commandText = $"INSERT INTO public.table({header.Replace('/', '_')}) VALUES ({formattedLine})";
            command.CommandText = commandText;
            await command.ExecuteNonQueryAsync();
            round++;
        }

        if (round == 1000) // batch size
        {
            await transaction.CommitAsync();
            Console.WriteLine("batch commited to DB at: " + DateTime.Now.ToString("ddMMyyy hh:mm:ss"));
            round = 0;
            transaction = connection.BeginTransaction();
        }
    }
    catch (Exception)
    {
        await connection.CloseAsync();
        await connection.DisposeAsync();
        await transaction?.RollbackAsync();
        throw;
    }
}

An idea on how to optimize this more?

alaa_sayegh
  • 1,791
  • 2
  • 15
  • 26

1 Answers1

0

The quickest way to read .csv files into a DB, is to do it in the DB Server/DBMS.

C# will never even get close to that performance. And at best you add the load to transfer the whole data over the network once. If you have to do it from code, there is usually a proper Bulk Insert way. Apparently for Post Gre that one is called COPY.

If that is not a option either, at least make a explicit DB-transaction. Doing Inserts in a loop will have the DB open and commit a Transaction every single INSERT. Instead using a explicit transaction around this can improove performance at least somewhat.

Christopher
  • 8,956
  • 2
  • 14
  • 31
  • am i not using the explicit transaction? any code example? – alaa_sayegh Jun 06 '20 at 21:29
  • @alaa_sayegh The first option is to not do it in code at all. | The second best, is to use COPY. Wiktor linked the documentation for it. | So there are no more code examples we could give. – Christopher Jun 06 '20 at 21:50