0

I have large csv-files with approximately 164 columns x 50000 rows and would like to import those files into a PostgreSQl-Database (12.2).

Using Django 3.0.7 and Python 3.6.8 I created a model DataRow which has one field for each column. When importing a file, I first validate it by checking for missing values and so on. Then I create a DataRow-object for each row and try to load everything into the database using bulk_create. Here a stylized short version of my code:

import pandas as pd

data = pd.read_csv(myfile)

validate_my_data(data)

data_row_objects = []
for row in data:
    data_row_objects.append(DataRow(**row_to_dict(row)))

DataRow.objects.bulk_create(data_row_objects, batch_size=3000)

So far it works. Validating and appending all objects to data_row_objects takes like 8 seconds, which is okay. However the bulk_create-statement takes about 3 minutes which is not acceptable.

So how can I improve the performance here? I already tried increasing/decreasing the batch_size but 3000 seems to be the most stable size so far.

I have also read on some posts here on stackoverflow that importing a csv file directly to the database is much faster. However, I would rather use pure django only because in my opinion this is one of the exact purposes of django: avoiding direct communication with databases.

Or do I have to change settings for my PostgreSQL? Maybe it is more slow since both django and PostgreSQL are inside a seperate docker-container.

To be honest it is quite difficult for me to distinguish the bottleneck of the performance here because I am still a beginner in terms of django and databases. But I think that there has to be a fast and elegant way because my datasets are not that big at all and importing such files should be a pretty common task.

Looking forward to your answers, Tom

ToM
  • 31
  • 3
  • I don't know whats happening on the python side but there are a couple of ways to speed up inserts: https://stackoverflow.com/questions/12206600/ – Florian Fida Jun 17 '20 at 21:27
  • 1
    Postgres also has a built in way to import csv files. If you want to take python/django out of the equation https://www.postgresql.org/docs/current/sql-copy.html – Florian Fida Jun 17 '20 at 21:30

0 Answers0