0

I have a problem. There are hundreds of CSV files, ca. 1,000,000 lines each one. I need to move that data in a specific way, but script working very slow (it passing few ten of tousands per hour).

My code:

    import sqlite3 as lite
    import csv
    import os



my_file = open('file.csv', 'r')
reader = csv.reader(my_file, delimiter=',')


date = '2014-09-29'

con = lite.connect('test.db', isolation_level = 'exclusive')
for row in reader:

    position = row[0]
    item_name = row[1]



    cur = con.cursor()
    cur.execute("CREATE TABLE IF NOT EXISTS [%s] (Date TEXT, Position INT)" % item_name)
    cur.execute("INSERT INTO [%s] VALUES(?, ?)" % item_name, (date, position))
con.commit()

I found an information saying about isolation_level and single accessing to database, but it didn't work well.

Lines CSV files have a structure: 1,item1 | 2,item2

Does anyone could to help me? Thanks!

Damien
  • 1
  • 1
  • 3

4 Answers4

3

Don't do sql inserts. Prepare CSV file first, then do:

.separator <separator>
.import <loadFile> <tableName>

See here: http://cs.stanford.edu/people/widom/cs145/sqlite/SQLiteLoad.html

fukanchik
  • 2,568
  • 22
  • 27
  • And how can this create the dymanic table names the OP wants? – jsbueno Mar 09 '15 at 18:49
  • @fukanchik it's very important to have given structure: table names: item1 | item2 | item3 rows in tables: 2014-03-15, 1 | 2014-03-16, 121 – Damien Mar 09 '15 at 18:56
  • 1
    Create temporary file with the structure suitable for .import. Copy data there from your CSVs, run import. – fukanchik Mar 09 '15 at 19:00
0

You certainly don't want to create a new cursor object for each row to insert - and checking for table creation at each line will certainly slow you down s well -

I'd suggest doing this in 2 passes: first you create the needed tables, on the second pass you record the data. If it is still slow, you could make a a more sophisticated in-memory collection of data to be inserted and perform "executemany" - but this would require some sophistication to group data by name in memory prior to comitting;.

import sqlite3 as lite
import csv
import os

my_file = open('file.csv', 'r')
reader = csv.reader(my_file, delimiter=',')

date = '2014-09-29'

con = lite.connect('test.db', isolation_level = 'exclusive')
cur  = con.cursor()

table_names = set(row[1] for row in reader)
my_file.seek(0)

for name in table_names:
     cur.execute("CREATE TABLE IF NOT EXISTS [%s] (Date TEXT, Position INT)" % item_name)

for row in reader:

    position = row[0]
    item_name = row[1]

    cur.execute("INSERT INTO [%s] VALUES(?, ?)" % item_name, (date, position))

con.commit()
jsbueno
  • 77,044
  • 9
  • 114
  • 168
0

The code is inefficient in that it performs two SQL statements for each row in CSV. Try to optimize.

  1. Is there a way to process CSV first and convert it to SQL statements?
  2. Are rows in CSV grouped by tables (item name's)? If yes, you can accumulate the rows to be inserted into the same table (generate a set of INSERT statements for the same table) and only prefix the resulting set of statements with CREATE TABLE IF NOT EXISTS once, not every of them.
  3. If possible, use bulk insert. If I get it right, bulk insert is introduced with SQLite v.3.27.1. More on this: Is it possible to insert multiple rows at a time in an SQLite database?
  4. If needed, bulk insert in chunks. More on this: Bulk insert huge data into SQLite using Python
Community
  • 1
  • 1
Valera Grishin
  • 431
  • 3
  • 9
0

I have the same problem. Now it is solved! I would like to share the methods with everyone who is facing the same problem!

We use sqlite3 database as an example, and other databases may also work but are not sure. We adopt pandas and sqlites modules in python.

This can convert a list of csv files [file1,file2,...] into talbes [table1,table2,...] quickly.

import pandas as pd
import sqlite3 as sql


DataBasePath="C:\\Users\\...\\database.sqlite"
conn=sql.connect(DataBasePath)

filePath="C:\\Users\\...\\filefolder\\"    
datafiles=["file1","file2","file3",...]

for f in datafiles:
    df=pd.read_csv(filePath+f+".csv")        
    df.to_sql(name=f,con=conn,if_exists='append', index=False)    
conn.close()

What's more, this code can create database if it doesn't exist. The argument of pd.to_sql() 'if_exists' is important. Its value is "fail" as default, which will import data if it exists otherwise does nothing; "replace" will drop the table first if it exists then create new table and import data; "append" will import data if it exists otherwise creates a new one can import data.

Nae
  • 10,363
  • 4
  • 30
  • 67