Inserting bulk data into Postgres

Question

I am facing a problem to insert hundreds of rows of data into a relational database (PostgreSQL). The data to be inserted is saved in a .csv file, and the data is to be inserted into multiple tables with all kinds of relationship (one-to-one, one-to-many, many-to-many). There're approximately 10+ tables with a number of associative entities.

Suppose Table A and Table B has a many-to-many relationship, and hence there's a Table AB (associative entity). During insertion, it is necessary that after inserting the data into Table A and Table B, the inserted primary keys are mapped into the associative entity Table AB.

To be more specific, the followings codes show how my some of my database tables are related (all tables are created using Django models):

class Restaurant(models.Model):
    ... 
    promotion = models.ForeignKey(Promotion)
    menus = models.ManyToManyField(Menu)

class Menu(models.Model):
    ...
    mains = models.ManyToManyField(Main)
    drinks = models.ManyToManyField(Drinks)
    desserts = models.ManyToManyField(Dessert)

Note: There are 10+ tables that are interrelated in a similar way as shown above

The .csv file data is as shown below, and would require incorrect data checking. For example:

no, restaurant name, restaurant description, restaurant type, restaurant homepage, restaurant country, restaurant age, restaurant awards, restaurant promotions, restaurant drinks, restaurant desserts, restaurant mains
1, Papa Delicious, This restaurant is ......, Italian, www.papadelicious.com, United States, 3, , , [Orange Juice, Apple Juice, Coffee], [Ice-cream], [Linguine, Some Burger]

While inserting into the database, any repeated dishes or drinks should not be inserted as a new row, but instead update the keys in the respective associative entity(s).

These data will be inserted a couple of times a week, and should take no longer than 1 min per row, as error checking will have already taken some time, and the insertion may be carried out by multiple users at one point of the time. However, the hundreds of rows of data would be a one-off insertion (approximately 600++ rows of data to be inserted)

I'm looking for an efficient way to insert the data into all appropriate tables. I've looked into COPY, but I don't really know how it would work to COPY into multiple tables at once and mapping the foreign key respectively.

I've also tried to write my own script to insert the data in a .py file, but row by row insertion can be highly inefficient, especially when i'm inserting into multiple tables and mapping foreign keys in associative entities.

I guess writing a script to do the check and insertion is the best solution to this problem, but doing the checking and all the foreign keys mapping row by row can be highly inefficient when dealing with lots of data and 10+ entities.

Are there any better solutions out there?

I'd request you to be more specific on your problem, or otherwise please refer to the complete guide. — Himanshu Mishra, Jul 12 '15 at 16:23
Speed-wise the key point here is django models, not row by row. I've dumped massive amounts data into django model tables using generated sql inserts - it's fast. Re integrity, can you read & insert the top level, no FK, tables first and then move to tables that now now their FK needs met? I.e Drinks first then Menu, then Restaurants? — JL Peyret, Jul 12 '15 at 17:41
Thanks for the short answer @JLPeyret. Not too sure if I'm getting what you meant, but are you referring to the bulk_insert() provided by Django 1.4? Referring to the documentation, it doesn't work with M-M relationship, does it? Or are you suggesting to generate multiple SQL INSERT statements in a .py file and concatenate them with semicolons, and then submit the statements in one shot? — Arial, Jul 13 '15 at 09:24
Could you be more specific about goals? Some possible metrics on an import procedure - execution speed, coding effort, robustness and repeatability. *Hundreds of rows* hardly justifies a lot of worrying about inefficient coding impact on execution time as long as you get your data in. Can you clarify the following? - Is this a one-time only import? If you have to repeat it, how frequently do you have to that? Total expected row volume? Max allowable import time - 5 minutes? 1 hr? overnight? How clean is the data? Can you paste a row of CSV w a FK sample too? — JL Peyret, Jul 13 '15 at 17:43
For example, removing FKs by hand and then adding them back seems very un-repeatable and un-robust. — JL Peyret, Jul 13 '15 at 17:44
Yeah removing FKs by hand and then adding them back is very unrepeatable. I'm thinking of creating a huge insertion query and return a set of inserted PKs to do the associative entity keys updates. @JLPeyret I've updated the question — Arial, Jul 15 '15 at 05:55

score 1 · Answer 1 · edited May 23 '17 at 11:43

1

PostgreSQL has a guide on how to best populate a database initially, and they suggest using the COPY command for bulk loading rows. The guide has some other good tips on how to speed up the process, like removing indexes and foreign keys before loading the data (and adding them back afterwards).

Also, this is an elaborated answer for the situation. Please check it out.

edited May 23 '17 at 11:43

Community

1
1

answered Jul 12 '15 at 16:22

Himanshu Mishra

6,634
10
32
71

Good links right there, thank you! Removing foreign keys and then adding them back seems like a plausible solution, but that involves checking every tables with foreign keys afterwards, and then do the insertion appropriately? That would involves cross checking it back to the original data and then update the foreign keys appropriately isn't it? – Arial Jul 12 '15 at 16:58
I guess yeah. You can give it a try and update the results. – Himanshu Mishra Jul 12 '15 at 17:00

score 0 · Answer 2 · answered Jul 16 '15 at 17:53

Figure out your speed requirements AFTER you get your code working. I doubt that you will be inserting rows any slower than 10/second so 600 rows = 2-3 minutes max.

figure out the dependencies between your tables. stand-alone tables (w.o. FKs) go first.

for each table.

ex:  Restaurant (needs to go after Menus and Promotions)

rest = Restaurant()  (or select for update if it exists already)

- map CSV cell data to direct model fieldnames.

    rest.name = csv.restaurant name
    rest.age = csv.age

- if a FK field
    get corresponding model by search (on name/label I assume)
    assign to model field

    rest.promotion = Promotion.objects.get(name=csv.promotion)

    you need to be catch missing lookups, report on it and proceed
    to the next row.  

    One flaw I foresee is that everything you have
    seems centred around 'identification by name'.  Drinks.objects.get(name="orange juice")
    What if you change the label or inconsistent?  What if a restaurant name changes?

    Or is the first "no=1" column is meant to address in the restaurant CSV a consistent key?

-   save model
    rest.save()

- if MTM field
    - lookup corresponding model by search 
    - push values onto an array

    li_menu = []
    for menu_name in csv.menu:
        menumtm = MenuMTM()
        menumtm.menu = Menus.objects.get(name=menu_name)
        menumtm.restaurant = rest
        menumtm.save()

if you DO need to optimize, you can pre-load commonly referred tables
so that you can skip "costly" Model.objects.get(name=name).  Probably
not worth the trouble.

Inserting bulk data into Postgres

2 Answers2