41

I'm looking for the most efficient way to bulk-insert some millions of tuples into a database. I'm using Python, PostgreSQL and psycopg2.

I have created a long list of tulpes that should be inserted to the database, sometimes with modifiers like geometric Simplify.

The naive way to do it would be string-formatting a list of INSERT statements, but there are three other methods I've read about:

  1. Using pyformat binding style for parametric insertion
  2. Using executemany on the list of tuples, and
  3. Using writing the results to a file and using COPY.

It seems that the first way is the most efficient, but I would appreciate your insights and code snippets telling me how to do it right.

Community
  • 1
  • 1
Adam Matan
  • 107,447
  • 124
  • 346
  • 512

9 Answers9

15

Yeah, I would vote for COPY, providing you can write a file to the server's hard drive (not the drive the app is running on) as COPY will only read off the server.

Andy Shellam
  • 14,487
  • 1
  • 24
  • 41
  • 21
    Using psycopg2's cursor.copy_from the file is handled by the client. It doesn't even need to be a file system file: any python file-like object works fine. Check http://initd.org/psycopg/docs/cursor.html#cursor.copy_from – piro Feb 16 '10 at 10:24
  • 1
    In that case it would be interesting to see how it's actually inserted into the database - I was under the impression PostgreSQL's COPY only read from the server's local file system and there was no way to bulk copy a file across using the client. – Andy Shellam Feb 16 '10 at 10:28
  • 5
    It can also read from `STDIN`, which mean data come from the client application. See copy command docs: http://www.postgresql.org/docs/8.4/static/sql-copy.html – piro Feb 16 '10 at 13:59
  • 2
    Using `copy` server-side is hard to beat for speed. – Avery Payne Feb 16 '10 at 14:21
  • As detailed in [this great comparison by Haki Benita](https://hakibenita.com/fast-load-data-python-postgresql), the main drawback of using `copy` is the lack of extensive support for complex data types (a better option is using pyscopg's `execute_batch`/`execute_values` if you have such a requirement...) – Bluu Dec 18 '20 at 15:22
10

There is a new psycopg2 manual containing examples for all the options.

The COPY option is the most efficient. Then the executemany. Then the execute with pyformat.

piro
  • 11,767
  • 4
  • 32
  • 35
8

in my experience executemany is not any faster than running many inserts yourself, the fastest way is to format a single INSERT with many values yourself, maybe in the future executemany will improve but for now it is quite slow

i subclass a list and overload the append method ,so when a the list reaches a certain size i format the INSERT to run it

FlashDD
  • 349
  • 3
  • 12
  • 1
    Upvoted: your experience is verified in http://stackoverflow.com/questions/8134602/psycopg2-insert-multiple-rows-with-one-query – Brian B Apr 11 '14 at 12:09
7

You could use a new upsert library:

$ pip install upsert

(you may have to pip install decorator first)

conn = psycopg2.connect('dbname=mydatabase')
cur = conn.cursor()
upsert = Upsert(cur, 'mytable')
for (selector, setter) in myrecords:
    upsert.row(selector, setter)

Where selector is a dict object like {'name': 'Chris Smith'} and setter is a dict like { 'age': 28, 'state': 'WI' }

It's almost as fast as writing custom INSERT[/UPDATE] code and running it directly with psycopg2... and it won't blow up if the row already exists.

Seamus Abshere
  • 7,730
  • 2
  • 40
  • 59
  • 8
    Nowadays, you should use the internal PostgreSQL implementation (requires version 9.5+), which consists of using `INSERT [...] ON CONFLICT [...]` https://www.postgresql.org/docs/9.5/static/sql-insert.html – Rmatt Jan 13 '17 at 15:30
3

Anyone using SQLalchemy could try 1.2 version which added support of bulk insert to use psycopg2.extras.execute_batch() instead of executemany when you initialize your engine with use_batch_mode=True like:

engine = create_engine(
    "postgresql+psycopg2://scott:tiger@host/dbname",
    use_batch_mode=True)

http://docs.sqlalchemy.org/en/latest/changelog/migration_12.html#change-4109

Then someone would have to use SQLalchmey won't bother to try different combinations of sqla and psycopg2 and direct SQL together.

user2189731
  • 378
  • 3
  • 12
  • How does `execute_batch()` compare with `cursor.copy_from(memory based file object)`? – Michael Goldshteyn Jul 26 '18 at 21:50
  • Here's [a great post by Haki Benita](https://hakibenita.com/fast-load-data-python-postgresql) who compares timing and memory usage on all the possible methods. His conclusion answers your question directly: copy (from memory) performs the fastest with the smallest memory footprint as well BUT has limited support for data type conversions. So it's still "better" to use the "native" methods on psycopg (i.e `execute_batch()`) if the dataset isn't too big OR if it contains complex data types. – Bluu Dec 18 '20 at 14:01
1

The first and the second would be used together, not separately. The third would be the most efficient server-wise though, since the server would do all the hard work.

Ignacio Vazquez-Abrams
  • 699,552
  • 132
  • 1,235
  • 1,283
  • Can you link to some code samples? I can't find any good psycopg2 resources on the web. – Adam Matan Feb 16 '10 at 10:08
  • Psycopg has a new manual: there's plenty of examples on it now. http://initd.org/psycopg/ – piro Feb 16 '10 at 10:20
  • 1
    Great, thanks. found another good example here: http://www.devx.com/opensource/Article/29071/0/page/3 , probably the best hands-on resource on psycopg2 there is. – Adam Matan Feb 16 '10 at 15:59
1

After some testing, unnest often seems to be an extremely fast option, as I learned from @Clodoaldo Neto's answer to a similar question.

data = [(1, 100), (2, 200), ...]  # list of tuples

cur.execute("""CREATE TABLE table1 AS
               SELECT u.id, u.var1
               FROM unnest(%s) u(id INT, var1 INT)""", (data,))

However, it can be tricky with extremely large data.

Community
  • 1
  • 1
n1000
  • 4,298
  • 3
  • 32
  • 57
0

A very related question: Bulk insert with SQLAlchemy ORM


All Roads Lead to Rome, but some of them crosses mountains, requires ferries but if you want to get there quickly just take the motorway.


In this case the motorway is to use the execute_batch() feature of psycopg2. The documentation says it the best:

The current implementation of executemany() is (using an extremely charitable understatement) not particularly performing. These functions can be used to speed up the repeated execution of a statement against a set of parameters. By reducing the number of server roundtrips the performance can be orders of magnitude better than using executemany().

In my own test execute_batch() is approximately twice as fast as executemany(), and gives the option to configure the page_size for further tweaking (if you want to squeeze the last 2-3% of performance out of the driver).

The same feature can easily be enabled if you are using SQLAlchemy by setting use_batch_mode=True as a parameter when you instantiate the engine with create_engine()

chjortlund
  • 2,592
  • 26
  • 26
0

The newest way of inserting many items is using the execute_values helper (https://www.psycopg.org/docs/extras.html#fast-execution-helpers).

from psycopg2.extras import execute_values

insert_sql = "INSERT INTO table (id, name, created) VALUES %s"
# this is optional
value_template="(%s, %s, to_timestamp(%s))"

cur = conn.cursor()

items = []
items.append((1, "name", 123123))
# append more...

execute_values(cur, insert_sql, items, value_template)
conn.commit()