241

I am testing Postgres insertion performance. I have a table with one column with number as its data type. There is an index on it as well. I filled the database up using this query:

insert into aNumber (id) values (564),(43536),(34560) ...

I inserted 4 million rows very quickly 10,000 at a time with the query above. After the database reached 6 million rows performance drastically declined to 1 Million rows every 15 min. Is there any trick to increase insertion performance? I need optimal insertion performance on this project.

Using Windows 7 Pro on a machine with 5 GB RAM.

Erwin Brandstetter
  • 479,275
  • 111
  • 893
  • 1,042
Luke101
  • 56,845
  • 75
  • 204
  • 330
  • 5
    It's worth mentioning your Pg version in questions too. In this case it doesn't make tons of difference, but it does for a lot of questions. – Craig Ringer Aug 31 '12 at 12:15
  • 1
    drop the indexes on the table and triggers if any and run the insert script. Once you have completed the bulk load you can recreate the indexes. – Sandeep Nov 03 '17 at 20:56

7 Answers7

514

See populate a database in the PostgreSQL manual, depesz's excellent-as-usual article on the topic, and this SO question.

(Note that this answer is about bulk-loading data into an existing DB or to create a new one. If you're interested DB restore performance with pg_restore or psql execution of pg_dump output, much of this doesn't apply since pg_dump and pg_restore already do things like creating triggers and indexes after it finishes a schema+data restore).

There's lots to be done. The ideal solution would be to import into an UNLOGGED table without indexes, then change it to logged and add the indexes. Unfortunately in PostgreSQL 9.4 there's no support for changing tables from UNLOGGED to logged. 9.5 adds ALTER TABLE ... SET LOGGED to permit you to do this.

If you can take your database offline for the bulk import, use pg_bulkload.

Otherwise:

  • Disable any triggers on the table

  • Drop indexes before starting the import, re-create them afterwards. (It takes much less time to build an index in one pass than it does to add the same data to it progressively, and the resulting index is much more compact).

  • If doing the import within a single transaction, it's safe to drop foreign key constraints, do the import, and re-create the constraints before committing. Do not do this if the import is split across multiple transactions as you might introduce invalid data.

  • If possible, use COPY instead of INSERTs

  • If you can't use COPY consider using multi-valued INSERTs if practical. You seem to be doing this already. Don't try to list too many values in a single VALUES though; those values have to fit in memory a couple of times over, so keep it to a few hundred per statement.

  • Batch your inserts into explicit transactions, doing hundreds of thousands or millions of inserts per transaction. There's no practical limit AFAIK, but batching will let you recover from an error by marking the start of each batch in your input data. Again, you seem to be doing this already.

  • Use synchronous_commit=off and a huge commit_delay to reduce fsync() costs. This won't help much if you've batched your work into big transactions, though.

  • INSERT or COPY in parallel from several connections. How many depends on your hardware's disk subsystem; as a rule of thumb, you want one connection per physical hard drive if using direct attached storage.

  • Set a high checkpoint_segments value and enable log_checkpoints. Look at the PostgreSQL logs and make sure it's not complaining about checkpoints occurring too frequently.

  • If and only if you don't mind losing your entire PostgreSQL cluster (your database and any others on the same cluster) to catastrophic corruption if the system crashes during the import, you can stop Pg, set fsync=off, start Pg, do your import, then (vitally) stop Pg and set fsync=on again. See WAL configuration. Do not do this if there is already any data you care about in any database on your PostgreSQL install. If you set fsync=off you can also set full_page_writes=off; again, just remember to turn it back on after your import to prevent database corruption and data loss. See non-durable settings in the Pg manual.

You should also look at tuning your system:

  • Use good quality SSDs for storage as much as possible. Good SSDs with reliable, power-protected write-back caches make commit rates incredibly faster. They're less beneficial when you follow the advice above - which reduces disk flushes / number of fsync()s - but can still be a big help. Do not use cheap SSDs without proper power-failure protection unless you don't care about keeping your data.

  • If you're using RAID 5 or RAID 6 for direct attached storage, stop now. Back your data up, restructure your RAID array to RAID 10, and try again. RAID 5/6 are hopeless for bulk write performance - though a good RAID controller with a big cache can help.

  • If you have the option of using a hardware RAID controller with a big battery-backed write-back cache this can really improve write performance for workloads with lots of commits. It doesn't help as much if you're using async commit with a commit_delay or if you're doing fewer big transactions during bulk loading.

  • If possible, store WAL (pg_xlog) on a separate disk / disk array. There's little point in using a separate filesystem on the same disk. People often choose to use a RAID1 pair for WAL. Again, this has more effect on systems with high commit rates, and it has little effect if you're using an unlogged table as the data load target.

You may also be interested in Optimise PostgreSQL for fast testing.

Ufuk Hacıoğulları
  • 36,026
  • 11
  • 106
  • 149
Craig Ringer
  • 259,831
  • 56
  • 584
  • 684
  • 1
    Would you agree that the write penalty from RAID 5/6 is somewhat mitigated if good quality SSDs are used? Obviously there is still a penalty, but I think the difference is far less painful than it is with HDDs. –  May 29 '14 at 20:09
  • 1
    I haven't tested that. I'd say it's probably less bad - the nasty write amplification effects and (for small writes) need for a read-modify-write cycle still exist, but the severe penalty for excessive seeking should be a non-issue. – Craig Ringer May 30 '14 at 00:09
  • Can we just disable indexes instead of dropping them, for example, by setting `indisvalid` (http://www.postgresql.org/docs/8.3/static/catalog-pg-index.html) to false, then load data and then bring indexes online by `REINDEX`? – Vladislav Rastrusny Dec 10 '14 at 09:35
  • @FractalizeR Unfortunately it's not that simple. I really don't recommend that approach. – Craig Ringer Dec 10 '14 at 09:49
  • @CraigRinger Can you please elaborate? Not that I am going to do that, I'm just curious. My current approach is to find (https://gist.github.com/FractalizeR/6c4fd1433611484e6c05), drop and then recreate all indexes. – Vladislav Rastrusny Dec 10 '14 at 10:40
  • @CraigRinger - A related Q, I am inserting about 30 million records. I am inserting this using INSERT into target select * from source. All this is wrapped inside a function. Is COPY likely to improve the performance? – Jayadevan Feb 04 '15 at 06:21
  • @Jayadevan `COPY` won't help you for moving data around within the DB, it's for import/export. `INSERT INTO ... SELECT ...` is as good as you are likely to get without going parallel across multiple connections with all the attendant hassle, lack of atomic commit, etc. – Craig Ringer Feb 06 '15 at 04:29
  • OK. Since many threads about improving performance of inserts suggested COPY, I thought may be COPY has some optimization for bulk data processing. So I tried COPYing data from source to file and COPYing from file target. It did not reduce the load time. I just wanted to confirm I did not miss anything. Thanks. – Jayadevan Feb 06 '15 at 06:10
  • 1
    @CraigRinger I've tested RAID-5 vs RAID-10 with SSD's on a Perc H730. RAID-5 is actually faster. Also it might be worth noting that insert/transactions in combination with large bytea's seems to be faster than copy. Overall good advice though. – atlaste Dec 01 '15 at 12:22
  • @CraigRinger I know this answer is really old, but people are reading this and thinking they can make `COPY` faster by disabling table logging, or setting on asynchronous commits. Could you please clarify in your answer what these optimizations bring to `COPY` and `INSERT`, so people don't think that this speeds up everything magically. – user157251 Jan 12 '17 at 01:23
  • Async commit won't help with `COPY`; as I said above "this won't help much if you've batched your work into big transactions, though." Using `UNLOGGED` tables _will_, though with the expected safety costs. – Craig Ringer Jan 12 '17 at 01:28
  • I asked a question here, may be interested in it http://dba.stackexchange.com/q/160807/2639 – user157251 Jan 12 '17 at 02:08
  • 2
    Anyone is seeing any major speed improvements with `UNLOGGED`? A quick test shows something like 10-20% improvement. – serg Feb 06 '17 at 23:25
  • @serg It depends on your storage performance: whether WAL and the main tablespace are on different volumes/disks, how good your storage is at concurrent I/O, and a number of other factors. There are non-speed advantages too like reducing AWS EBS iops charges and pool depletion, reducing size of WALs archived for point-in-time recovery / streaming replication data sizes, etc. – Craig Ringer Feb 10 '17 at 08:32
  • There is a limit for making really big transaction, actually. It's 2^32-2 commands per transaction, and I met this condition once (thanks, osmosis!). – ceteras Feb 28 '17 at 09:00
  • This answer is likely a good candidate for a topic somewhere under https://stackoverflow.com/documentation/postgresql/topics – Leonid Jul 01 '17 at 17:04
  • I'm using the hibernate-PostgreSQL stack, but from documents over the internet, it says batch insert will not increase performance if there are related table insertions.How can we handle those cases? – Akhil S Kamath Oct 03 '17 at 05:22
  • "documents on the Internet". Links? – Craig Ringer Oct 03 '17 at 06:10
  • What about removing primary key constraints? These have an index associated with them, so must also result in extra work during an import? – Dan Gravell Oct 25 '17 at 12:27
  • @CraigRinger: One nitpick regarding: "Use synchronous_commit=off and a huge commit_delay". Just synchronous_commit=off is enough since "commit_delay is ignored during an asynchronous commit" Source: https://www.postgresql.org/docs/9.5/static/wal-async-commit.html – Grzegorz Luczywo Feb 17 '18 at 12:29
16

Use COPY table TO ... WITH BINARY which is according to the documentation "is somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.

Here is an example recipe in Python, using psycopg2 with binary input.

Mike T
  • 34,456
  • 15
  • 128
  • 169
  • 2
    Binary mode can be a big time saving on some inputs, such as timestamps, where parsing them is nontrivial. For many data types it doesn't offer much benefit or is can even be slightly slower due to increased bandwidth (e.g. small integers). Good point raising it. – Craig Ringer Apr 07 '15 at 14:10
13

I spent around 6 hours on the same issue today. Inserts go at a 'regular' speed (less than 3sec per 100K) up until to 5MI (out of total 30MI) rows and then the performance sinks drastically (all the way down to 1min per 100K).

I will not list all of the things that did not work and cut straight to the meat.

I dropped a primary key on the target table (which was a GUID) and my 30MI or rows happily flowed to their destination at a constant speed of less than 3sec per 100K.

Dennis
  • 3,537
  • 6
  • 45
  • 71
11

In addition to excellent Craig Ringer's post and depesz's blog post, if you would like to speed up your inserts through ODBC (psqlodbc) interface by using prepared-statement inserts inside a transaction, there are a few extra things you need to do to make it work fast:

  1. Set the level-of-rollback-on-errors to "Transaction" by specifying Protocol=-1 in the connection string. By default psqlodbc uses "Statement" level, which creates a SAVEPOINT for each statement rather than an entire transaction, making inserts slower.
  2. Use server-side prepared statements by specifying UseServerSidePrepare=1 in the connection string. Without this option the client sends the entire insert statement along with each row being inserted.
  3. Disable auto-commit on each statement using SQLSetConnectAttr(conn, SQL_ATTR_AUTOCOMMIT, reinterpret_cast<SQLPOINTER>(SQL_AUTOCOMMIT_OFF), 0);
  4. Once all rows have been inserted, commit the transaction using SQLEndTran(SQL_HANDLE_DBC, conn, SQL_COMMIT);. There is no need to explicitly open a transaction.

Unfortunately, psqlodbc "implements" SQLBulkOperations by issuing a series of unprepared insert statements, so that to achieve the fastest insert one needs to code up the above steps manually.

Maxim Egorushkin
  • 119,842
  • 14
  • 147
  • 239
  • Large socket buffer size, `A8=30000000` in connection string should also used to speed up inserts. – Andrus Mar 21 '20 at 08:47
7

If you happend to insert colums with UUIDs (which is not exactly your case) and to add to @Dennis answer (I can't comment yet), be advise than using gen_random_uuid() (requires PG 9.4 and pgcrypto module) is (a lot) faster than uuid_generate_v4()

=# explain analyze select uuid_generate_v4(),* from generate_series(1,10000);
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series  (cost=0.00..12.50 rows=1000 width=4) (actual time=11.674..10304.959 rows=10000 loops=1)
 Planning time: 0.157 ms
 Execution time: 13353.098 ms
(3 filas)

vs


=# explain analyze select gen_random_uuid(),* from generate_series(1,10000);
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series  (cost=0.00..12.50 rows=1000 width=4) (actual time=252.274..418.137 rows=10000 loops=1)
 Planning time: 0.064 ms
 Execution time: 503.818 ms
(3 filas)

Also, it's the suggested official way to do it

Note

If you only need randomly-generated (version 4) UUIDs, consider using the gen_random_uuid() function from the pgcrypto module instead.

This droped insert time from ~2 hours to ~10 minutes for 3.7M of rows.

Community
  • 1
  • 1
1

For optimal Insertion performance disable the index if that's an option for you. Other than that, better hardware (disk, memory) is also helpful

Icarus
  • 60,193
  • 14
  • 91
  • 110
-4

I encountered this insertion performance problem as well. My solution is spawn some go routines to finish the insertion work. In the meantime, SetMaxOpenConns should be given a proper number otherwise too many open connection error would be alerted.

db, _ := sql.open() 
db.SetMaxOpenConns(SOME CONFIG INTEGER NUMBER) 
var wg sync.WaitGroup
for _, query := range queries {
    wg.Add(1)
    go func(msg string) {
        defer wg.Done()
        _, err := db.Exec(msg)
        if err != nil {
            fmt.Println(err)
        }
    }(query)
}
wg.Wait()

The loading speed is much faster for my project. This code snippet just gave an idea how it works. Readers should be able to modify it easily.

Patrick
  • 2,369
  • 1
  • 20
  • 23
  • Well, you can say that. But it does reduce the running time from a few hours to several minutes for millions of rows for my case. :) – Patrick Mar 23 '19 at 17:33