0

I have a data frame that I want to write to a Postgres database. This functionality needs to be part of a Flask app.

For now, I'm running this insertion part as a separate script by creating an SQLAlchemy engine and passing it to the df.to_sql() to write the data frame to a database table.

But when I integrate this functionality into a Flask app, I already have existing connections to the Postgres database which were created using Psycopg2 connection pool.

When looked at df.to_sql() documentation, it is mentioned that it uses the SQLAlchemy engine. I don't see any other connection mechanism. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html#pandas-dataframe-to-sql

My question is why do I need this SQLAlchemy engine to be created when I have the existing connections. Why can't I use them?

Underoos
  • 3,105
  • 3
  • 22
  • 47

1 Answers1

3

You can use those connections and avoid SQLAlchemy. This is going to sound rather unintuitive, but it will be much faster than regular inserts (even if you were to drop the ORM and make a general query e.g. with executemany). Inserts are slow, even with raw queries, but you'll see that COPY is mentioned several times in How to speed up insertion performance in PostgreSQL. In this instance, my motivations for the approach below are:

  1. Use COPY instead of INSERT
  2. Don't trust Pandas to generate the correct SQL for this operation (although, as noted by Ilja Everilä, this approach actually got added to Pandas in V0.24)
  3. Don't write the data to disk to make an actual file object; keep it all in memory

Suggested approach using cursor.copy_from():

import csv
import io
import psycopg2

df = "<your_df_here>"

# drop all the columns you don't want in the insert data here

# First take the headers
headers = df.columns

# Now get a nested list of values
data = df.values.tolist()

# Create an in-memory CSV file
string_buffer = io.StringIO()
csv_writer = csv.writer(string_buffer)
csv_writer.writerows(data)

# Reset the buffer back to the first line
string_buffer.seek(0)

# Open a connection to the db (which I think you already have available)
with psycopg2.connect(dbname=current_app.config['POSTGRES_DB'], 
                      user=current_app.config['POSTGRES_USER'],
                      password=current_app.config['POSTGRES_PW'], 
                      host=current_app.config['POSTGRES_URL']) as conn:
    c = conn.cursor()

    # Now upload the data as though it was a file
    c.copy_from(string_buffer, 'the_table_name', sep=',', columns=headers)
    conn.commit()

This should be orders of magnitude faster than actually doing inserts.

roganjosh
  • 10,918
  • 4
  • 25
  • 39
  • 1
    In relation to your 2nd point it's worth a note that you can use `COPY` with `to_sql` rather easily: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method – Ilja Everilä Nov 02 '19 at 07:24
  • @IljaEverilä Interesting. I didn't know it got added in `0.24`. Thanks – roganjosh Nov 02 '19 at 09:14
  • This answer helps me to boost performance. However, I was looking for confirmation if I can use the existing DB connections to insert the data frame using `df.to_sql`. Thanks in advance. – Underoos Nov 02 '19 at 22:06
  • 1
    In advance of what? – roganjosh Nov 02 '19 at 23:18
  • I'm not sure how the creation of a new in-memory CSV file is going to affect. Is it overhead to create a new file particularly if the data frame is small and can be inserted with minimal delay. – Underoos Nov 04 '19 at 07:39
  • 1
    @SukumarRdjf well, you'll need to determine that yourself. I can't see what else you expect me to say here; I've already shown you exactly what you asked for – roganjosh Nov 04 '19 at 09:11