Writing more than 50 millions from Pyspark df to PostgresSQL, best efficient approach

Question

What would be the most efficient way to insert millions of records say 50-million from a Spark dataframe to Postgres Tables. I have done this from spark to MSSQL in the past by making use of bulk copy and batch size option which was successful too.

Is there something similar that can be here for Postgres?

Adding the code I have tried and the time it took to run the process:

def inserter():
    start = timer()
    sql_res.write.format("jdbc").option("numPartitions","5").option("batchsize","200000")\
    .option("url", "jdbc:postgresql://xyz.com:5435/abc_db") \
    .option("dbtable", "public.full_load").option("user", "root").option("password", "password").save()
    end = timer()
    print(timedelta(seconds=end-start))
inserter()

So I did the above approach for 10 million records and had 5 parallel connections as specified in numPartitions and also tried batch size of 200k.

The total time it took for the process was 0:14:05.760926 (fourteen minutes and five seconds).

Is there any other efficient approach which would reduce the time?

What would be the efficient or optimal batch size I can use ? Will increasing my batch size do the job quicker ? Or opening multiple connections i.e > 5 help me make the process quicker ?

On an average 14 mins for 10 million records is not bad, but looking for people out there who would have done this before to help answer this question.

You could dump the data to a local CSV file first, and then use PostgreSQL's own import tools to import it - it depends on where the bottleneck is: is it slow to export from Pyspark or slow to import to Postgres, or something else? (That said, 14 minutes for 50 million rows doesn't seem that bad to me - what indexes are defined on the table?). — Dai, Dec 20 '19 at 06:21
Dai, I have a df which is 52mil and now I am writing it to Postgres, it’s a new table I am creating through the above code. I have not created the table in Postgres and then writing there. Is there a better possibility if I can create a table first and index it there in Postgres and then send data from spark df ? — Chetan_Vasudevan, Dec 20 '19 at 06:25
(It's the other way around - indexes slow down insert operations on tables, but speed up select queries) — Dai, Dec 20 '19 at 07:25
Dai, so I just create the table in Postgres without index and then try inserting and measure my performance? — Chetan_Vasudevan, Dec 20 '19 at 07:32
https://stackoverflow.com/questions/758945/whats-the-fastest-way-to-do-a-bulk-insert-into-postgres might be helpful. — Alexey Romanov, Dec 20 '19 at 09:09

score 4 · Accepted Answer · answered Dec 21 '19 at 15:57

I actually did kind of the same work a while ago but using Apache Sqoop.

I would say that for answering this questions we have to try to optimize the communication between Spark and PostgresSQL, specifically the data flowing from Spark to PostgreSql.

But be careful, do not forget Spark side. It does not make sense to execute mapPartitions if the number of partitions is too high compared with the number of maximum connections which PostgreSQL support, if you have too many partitions and you are opening a connection for each one, you will probably have the following error org.postgresql.util.PSQLException: FATAL: sorry, too many clients already.

In order to tune the insertion process I would approach the problem following the next steps:

Remember the number of partitions is important. Check the number of partitions and then adjust it based on the number of parallel connection you want to have. You might want to have one connection per partition, so I would suggest to check coalesce, as is mentioned here.
Check the max number of connections which your postgreSQL instance support and you want to increase the number.
For inserting data into PostgreSQL is recommended using COPY command. Here is also a more elaborated answer about how to speed up postgreSQL insertion.

Finally, there is no silver bullet to do this job. You can use all the tips I mentioned above but it will really depends on your data and use cases.

Writing more than 50 millions from Pyspark df to PostgresSQL, best efficient approach

1 Answers1