0

How can I insert millions or rows from a .txt file into a SQLite3 database using Flask-SQLAlchemy? I tried just reading one line at the time from the .txt file and adding and committing them in a loop but noticed that tis takes an enormous amount of time to do. How can I do this effectively? I tried implementing this solution https://stackoverflow.com/a/7137270 to my code but couldn't get it to work.

The table schema looks like this:

class table(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    col1 = db.Column(db.Integer)
    col2 = db.Column(db.Integer)
    col3 = db.Column(db.String(50))
    col4 = db.Column(db.String(50))
    col5 = db.Column(db.String(50))
    col6 = db.Column(db.Integer)
    col7 = db.Column(db.String(50))
    col8 = db.Column(db.Integer)
    col9 = db.Column(db.Integer)
    col10 = db.Column(db.Integer)
    col11 = db.Column(db.Integer)
    col12 = db.Column(db.Integer)

The rows in the .txtfile look like this:

hjk;28770930;Y;T;C;;asd;;1;1233;1233;0.00081103

And there are about 85M rows to add to the db.

  • Possible duplicate of [Is it possible to insert multiple rows at a time in an SQLite database?](https://stackoverflow.com/questions/1609637/is-it-possible-to-insert-multiple-rows-at-a-time-in-an-sqlite-database) – Fine Jun 25 '18 at 12:46
  • Plus, pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) + [to_sql](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) are awesome. – Fine Jun 25 '18 at 12:48
  • Just remove all those `commit()` calls, and do one `commit()` at the end. – CL. Jun 25 '18 at 21:14
  • @CL. I tried this as-well but it was running over an hour and didn't reach the commit. –  Jun 26 '18 at 06:18

1 Answers1

0

I found one solution that makes the transaction significantly faster. I used the answers from: https://stackoverflow.com/a/7137270/9988919 https://stackoverflow.com/a/32271651/9988919

Instead of reading one line at the time and writing one row every iteration, I use the def chunks() function to divide the file into chunks and yielding a generator. Then in the asdasd function looping over the chunks and committing after every chunk that contains 10000 rows.

I would still be interested to hear if anyone can figure out a faster way because this will also take approximately 5 hours.

Here is my code:

def chunks(data, n=10000):
    buffer = [None] * n
    idx = 0
    for record in data:
        buffer[idx] = record
        idx += 1
        if idx == n:
            yield buffer
            buffer = [None] * n
            idx = 0
    if idx > 0:
        yield buffer[:idx]

def load_data_table(filename):
    rows = 0
    csvData = csv.reader(open('./folder/{}'.format(filename), "r"), delimiter=";")
    dataset = tables.query.filter_by(id=1).first()
    divData = chunks(csvData)  # divide into 10000 rows each

    for chunk in divData:
        for col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12 in chunk:
            add_new = table(col1=col1, col2=col2, col3=col3, col4=col4, col5=col5, col6=col6, col7=col7, col8=col8, col9=col9, col10=col10, col11=col11, col12=col12)
            db.session.add(add_new)
        db.session.commit()
        rows += 10000
        print(rows)