1

I "inherited" some legacy software for manipulation of large text files which is currently written in Visual Foxpro (version 7 specifically).

I'm exploring options for replacing these programs with something new that will be easier to modify and maintain. However, I'm unable to get close to the performance of Foxpro for importing large text files.

For example I have a ~1gb text file with ~110,000 records each having 213 fields (each row is 9247 characters long) that I'm testing with.

Using Foxpro:

? SECONDS()
USE new
APPEND FROM test.dat TYPE sdf
USE
? SECONDS()

This import to a database is completed in just under 10 seconds on my computer.

Using C#/SQLite (with Filehelpers and System.Data.SQLite), the quickest I have been able to get this import to complete is over a minute. I've tried optimizing this as best possible using suggestions from this question (such as transactions etc). Realistically a minute for importing a 1gb file doesn't seem bad if I wasn't comparing it to 10 seconds for Foxpro.

The approximate breakdown of time spent during the ~1 minute is:

Filehelpers reading file and parsing: 12 seconds.
Building SQL commands from Filehelpers object: 15 seconds.
Running individual ExecuteNonQuery()'s: 24 seconds.
Committing transactions (every 5000 records): 12 seconds.

When compared to the thread linked my inserts per second are much slower but my records have 213 fields vs 7 so this is expected. If I break it down by fields/second I'm at approximately 360,000 vs the 630,000 of the thread. Insert rate by megabyte is ~2.24 megabyte/s for the other poster and 15.4 megatbyte/s for me. Therefore I think my performance is comparable to the other poster and there likely isn't a ton more optimization I can do.

Why is this so much slower than the Foxpro import (5-6x slower)? Is this just apples to oranges and I should just accept the slower speed in trade-off for the other benefits I get by using newer tech?

EDIT:

Here is some test code I've run with random data showing similar speeds.... am I implementing the transactions correctly? The ExecuteNonQuery() is taking a significant amount of time. I've implemented all of the SQL commands with simple, non parameterized, most likely inefficient string manipulation but at this point I'm not overly concerned with that (I'll be reading from files and this method was much quicker than using entity framework).

public class Program
{
    public static void Main(string[] args)
    {
        TestSqlite(100, 20000);
    }

    public static void TestSqlite(int numColumns, int numRows)
    {
        Console.WriteLine("Starting Import.... columns: " + numColumns.ToString() + " rows: " + numRows.ToString());

        var conn = new SQLiteConnection(@"Data Source=C:\vsdev\SqliteTest\src\SqliteTest\ddic.db;Version=3");
        conn.Open();
        var cmd = new SQLiteCommand(conn);
        cmd.CommandText = "DROP TABLE IF EXISTS test";
        cmd.ExecuteNonQuery();

        string createCmd = "CREATE TABLE 'test'('testId' INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL, ";

        for (var i = 0; i < numColumns; i++)
        {
            createCmd += "'test" + i.ToString() + "' TEXT, ";
        }

        createCmd = createCmd.Substring(0, createCmd.Length - 2);
        createCmd += ")";
        cmd.CommandText = createCmd;
        cmd.ExecuteNonQuery();

        Stopwatch stopWatch = new Stopwatch();
        stopWatch.Start();

        var transaction = conn.BeginTransaction();


        int lineCount = 0;
        long startTime;
        long endTime;
        long totalStringTime = 0;
        long totalAddTime = 0;
        long totalSaveTime = 0;
        string command;
        for (var l = 0; l < numRows; l++)
        {


            startTime = stopWatch.ElapsedMilliseconds;
            command = CreateRandomInsert("test", numColumns);
            endTime = stopWatch.ElapsedMilliseconds;
            totalStringTime += endTime - startTime;


            ///Execute Query
            startTime = stopWatch.ElapsedMilliseconds;
            cmd.CommandText = command;
            cmd.ExecuteNonQuery();
            endTime = stopWatch.ElapsedMilliseconds;
            totalAddTime += endTime - startTime;


            if (lineCount > 5000)
            {
                lineCount = 0;
                startTime = stopWatch.ElapsedMilliseconds;

                transaction.Commit();
                transaction.Dispose();
                transaction = conn.BeginTransaction();
                cmd = new SQLiteCommand(conn);
                endTime = stopWatch.ElapsedMilliseconds;
                totalSaveTime += endTime - startTime;
                Console.Write('.');
            }
            lineCount += 1;

        }

        startTime = stopWatch.ElapsedMilliseconds;
        transaction.Commit();
        transaction.Dispose();
        endTime = stopWatch.ElapsedMilliseconds;
        totalSaveTime += endTime - startTime;


        Console.WriteLine('.');
        Console.WriteLine("String time: " + totalStringTime.ToString());
        Console.WriteLine("ExecuteNonQuery time: " + totalAddTime.ToString() + ", per 1000 records: " + (totalAddTime / (numRows / 1000)).ToString());
        Console.WriteLine("Commit time: " + totalSaveTime.ToString() + ", per 1000 records: " + (totalSaveTime / (numRows / 1000)).ToString());


        stopWatch.Stop();
        TimeSpan ts = stopWatch.Elapsed;
        string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
            ts.Hours, ts.Minutes, ts.Seconds,
            ts.Milliseconds / 10);
        Console.WriteLine(" in " + elapsedTime);

        conn.Close();

    }


    public static string CreateRandomInsert(string TableName, int numColumns)
    {

        List<string> nameList = new List<string>();
        List<string> valueList = new List<string>();

        for (var i = 0; i < numColumns; i++)
        {
            nameList.Add("test" + i.ToString());
            valueList.Add(Guid.NewGuid().ToString());
        }
        return CreateSql(TableName, nameList, valueList);
    }
    public static string CreateSql(string TableName, List<string> names, List<string> values)
    {
        string textCommand = "";

        textCommand += "INSERT INTO " + TableName + " (";

        foreach (var nameVal in names)
        {
            textCommand += nameVal + ", ";
        }
        textCommand = textCommand.Substring(0, textCommand.Length - 2);
        textCommand += ") VALUES (";
        foreach (var val in values)
        {
            textCommand += "'" + val + "', ";
        }
        textCommand = textCommand.Substring(0, textCommand.Length - 2);
        textCommand += ");";

        return textCommand;
    }
}
Community
  • 1
  • 1
playsted
  • 337
  • 2
  • 9

1 Answers1

0

When it is fixed width data, VFP has an advantage over other databases. Low level, it is almost the same format that VFP stores its data. For example, if you don't have any fields that are incompatible with Foxpro 2.x, then importing that data would simply mean adding a 0x20 byte in front of each row (+ writing the file header if it were not written already) - and updating of indexes if any, this would be the real time consuming part. If you needed to parse the lines, then VFP is quite slow on string operations, compared to C#.

(I didn't know FileHelpers, thanks for pointing to it)

9247 bytes fixed data and 213 fields is interesting. I would like to test insertion into SQLite, postgreSQL, SQL server and some NoSQL databases. Could you share an empty dbf (or just the structure as code or xml - maybe with 1-2 rows of sample data from file)? I didn't try before with 200+ fields, otherwise 1 minute for 110K rows sound to be slow.

A VFP table would be able append such data only 2 times and you would need to create another table. You are saying "manipulation" of text files, but you didn't comment on what type of manipulation was that. Maybe direct "manipulation" without appending to a table is more efficient and faster and whole process complete faster? Just a thought.

Cetin Basoz
  • 15,389
  • 1
  • 22
  • 34
  • Great info Cetin. I figured there must be something fundamentally easier about what Foxpro was doing. I'll try to put together a sample file for you. If you think that the insert rate is slow I'll continue playing around with settings in order to optimize it as well. – playsted Jan 10 '16 at 17:20
  • FYI I've added code with random data showing similar insert speeds. – playsted Jan 11 '16 at 04:54