Python CSV to SQLite

Question

I am "converting" a large (~1.6GB) CSV file and inserting specific fields of the CSV into a SQLite database. Essentially my code looks like:

import csv, sqlite3

conn = sqlite3.connect( "path/to/file.db" )
conn.text_factory = str  #bugger 8-bit bytestrings
cur = conn.cur()
cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')

reader = csv.reader(open(filecsv.txt, "rb"))
for field1, field2, field3, field4, field5 in reader:
  cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))

Everything works as I expect it to with the exception... IT TAKES AN INCREDIBLE AMOUNT OF TIME TO PROCESS. Am I coding it incorrectly? Is there a better way to achieve a higher performance and accomplish what I'm needing (simply convert a few fields of a CSV into SQLite table)?

**EDIT -- I tried directly importing the csv into sqlite as suggested but it turns out my file has commas in fields (e.g. "My title, comma"). That's creating errors with the import. It appears there are too many of those occurrences to manually edit the file...

any other thoughts??**

How many duplicate records are there? If there are a lot, it would probably be faster to keep a local `set` of records that have already been inserted, and skip the call to the SQL entirely for the duplicates. — kindall, May 09 '11 at 21:06
[Here](http://dev.mysql.com/doc/refman/5.5/en/insert-speed.html) are some MySQL bulk load speed tips. — kindall, May 09 '11 at 21:52
"It appears there are too many of those occurrences to manually edit the file..". Let's think. Too many to change manually? If only you had a programming language that would allow you to write a program to reformat a CSV file into a TAB-delimited file. Any ideas what language could be used to write a program like that? — S.Lott, May 10 '11 at 01:10
The 2020 solution is to use Pandas. Pandas has excellent SQL writers that let you write in chunks. Pandas makes this problem very easy and saves you from having to worry about low level details. Pandas readers make it easy to address and weridness in the CSV file before you write to SQL (saving you from another common headache). See my answer for more detail. — Powers, Sep 18 '20 at 20:46

score 27 · Answer 1 · edited May 23 '17 at 10:31

Chris is right - use transactions; divide the data into chunks and then store it.

"... Unless already in a transaction, each SQL statement has a new transaction started for it. This is very expensive, since it requires reopening, writing to, and closing the journal file for each statement. This can be avoided by wrapping sequences of SQL statements with BEGIN TRANSACTION; and END TRANSACTION; statements. This speedup is also obtained for statements which don't alter the database." - Source: http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html

"... there is another trick you can use to speed up SQLite: transactions. Whenever you have to do multiple database writes, put them inside a transaction. Instead of writing to (and locking) the file each and every time a write query is issued, the write will only happen once when the transaction completes." - Source: How Scalable is SQLite?

import csv, sqlite3, time

def chunks(data, rows=10000):
    """ Divides the data into 10000 rows each """

    for i in xrange(0, len(data), rows):
        yield data[i:i+rows]


if __name__ == "__main__":

    t = time.time()

    conn = sqlite3.connect( "path/to/file.db" )
    conn.text_factory = str  #bugger 8-bit bytestrings
    cur = conn.cur()
    cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')

    csvData = csv.reader(open(filecsv.txt, "rb"))

    divData = chunks(csvData) # divide into 10000 rows each

    for chunk in divData:
        cur.execute('BEGIN TRANSACTION')

        for field1, field2, field3, field4, field5 in chunk:
            cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))

        cur.execute('COMMIT')

    print "\n Time Taken: %.3f sec" % (time.time()-t)

Another user following this code ran into a problem trying to use `len()` with their CSV reader: http://stackoverflow.com/questions/18062694/sqlite-transaction-for-csv-importing/18063276#18063276 — rutter, Aug 05 '13 at 16:39

score 26 · Answer 2 · answered May 09 '11 at 21:01

26

It's possible to import the CSV directly:

sqlite> .separator ","
sqlite> .import filecsv.txt mytable

http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles

answered May 09 '11 at 21:01

fengb

1,408
14
15

Doesn't seem like there's a built-in way of escaping by default. Also, the quotes will be literals within the string. It might make sense to change the text using a CSV parse and outputting with a different separator but that might defeat the purpose of using the import tool in the first place. – fengb May 10 '11 at 01:32
1

Try: .mode csv instead of .separator, see: http://stackoverflow.com/questions/14947916/import-csv-to-sqlite/24582022#24582022 – NumesSanguis Dec 03 '14 at 14:40

score 17 · Answer 3 · answered Mar 28 '12 at 18:54

As it's been said (Chris and Sam), transactions do improve a lot insert performance.

Please, let me recommend another option, to use a suite of Python utilities to work with CSV, csvkit.

To install:

pip install csvkit

To solve your problem

csvsql --db sqlite:///path/to/file.db --insert --table mytable filecsv.txt

score 3 · Answer 4 · edited Jul 25 '14 at 02:00

3

Try using transactions.

begin    
insert 50,000 rows    
commit

That will commit data periodically rather than once per row.

edited Jul 25 '14 at 02:00

Cristian Ciupitu

18,164
7
46
70

answered May 10 '11 at 20:24

Chris N

617
7
7

score 0 · Answer 5 · answered Sep 18 '20 at 20:44

Pandas makes it easy to load big files into databases in chunks. Read the CSV file into a Pandas DataFrame and then use the Pandas SQL writer (so Pandas does all the hard work). Here's how to load the data in 100,000 row chunks.

import pandas as pd

orders = pd.read_csv('path/to/your/file.csv')
orders.to_sql('orders', conn, if_exists='append', index = False, chunksize=100000)

Modern Pandas versions are very performant. Don't reinvent the wheel. See here for more info.

Python CSV to SQLite

5 Answers5

Linked