4

I am putting together a system that collects data from Quandl and stores it in a database. I should note that there is no commercial aspect to what I am doing (I have no customer/employer). I am doing this as a hobby and to hopefully learn a thing or two.

Anyway, the challenge I have set myself is to build a system that automatically downloads data from Quandl and stores it in a database, without ever saving zip or csv files to disk.

Quandl provides daily 'delta' files which can be downloaded as zip files. The zip files are extracted to csv files. I have managed to get as far as downloading the zip files and extracting the csv files all in memory, using a MemoryStream, ZipArchive, and StreamReader in .Net (F# specifically - happy to provide a code snippet if required).

Now the challenge I am stuck on is how to get this over to my database. The database I am using is MariaDB (which is essentially the same as MySQL). I am using this because this is the only type of database my NAS supports.

Options are

  1. Give up on my objective of not ever saving to disk and save the csv to disk, then pass the file path to a stored procedure as in this answer.
  2. I can convert the csv data into JSON or XML and pass it to a stored procedure and have the server parse the string into a tempory table. I have done this before using SQL Server and am assuming something similar is possible here.
  3. Read the csv line by line and pass to the database line by line. This is really a non option as it would be very slow.

Seems like 2 is the best option I know of. Is there a more direct way that does not involve converting csv to JSON or XML?

Chechy Levas
  • 1,764
  • 1
  • 10
  • 21
  • 1
    Would anything in this blog post be helpful? https://medium.com/@edgarsanchezg/hey-f-load-me-this-csv-file-into-a-table-14c60a3b0842 – Tomas Petricek Aug 16 '17 at 17:19
  • @TomasPetricek - I've only read the first paragraph and came back to say that I can't believe you managed to find something so specifically on target for my question! I'll read the rest and let you know if any problems, but it looks promising. – Chechy Levas Aug 16 '17 at 17:27

1 Answers1

2

LOAD DATA INFILE will be, by far!, the fastest way to go. But it does require you to put the CSV data into a file system. You may have a temporary, even a RAM, file system in your setup for doing this.

In the dotnet world, there's a robust module for reading CSV data from streams. Files are a special case of streams. The module is called, for historic reasons, Microsoft.VisualBasic.FileIO.TextFieldParser. (It works fine outside Visual Basic, it just has a name from long ago.)

If you use this approach, you can improve performance by inserting multiple rows of the CSV in each transaction. There are two ways to do that.

One is multirow inserts, like so

     INSERT INTO tbl 
     (col,col,col)
     VALUES 
     (val, val, val),
     (val, val, val),
     (val, val, val),
     ...
     (val, val, val);

The other is to use START TRANSACTION, then do a couple of hundred inserts, then do COMMIT, then repeat that until you're done. Experience teaches that will make your insertion reasonably fast.

Parsing JSON in a MySQL stored procedure? Absurdly hard to debug. And, you'll still have to manage the transactions as I mentioned.

O. Jones
  • 81,279
  • 15
  • 96
  • 133