How to create a large pandas dataframe from an sql query without running out of memory?

Question

I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.

This works:

import pandas.io.sql as psql
sql = "SELECT TOP 1000000 * FROM MyTable" 
data = psql.read_frame(sql, cnxn)

...but this does not work:

sql = "SELECT TOP 2000000 * FROM MyTable" 
data = psql.read_frame(sql, cnxn)

It returns this error:

File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples
(pandas\lib.c:42733) Memory Error

I have read here that a similar problem exists when creating a dataframe from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:

read_csv('exp4326.csv', iterator=True, chunksize=1000)

Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.

Depending on the `dtype` of your columns and the number of columns you could easily reach 4GB. E.g., — Phillip Cloud, Aug 08 '13 at 13:54
Starting from pandas 0.15, you have a `chunksize` option in `read_sql` to read and process the query chunk by chunk: http://pandas.pydata.org/pandas-docs/version/0.15.0/io.html#querying — joris, Oct 08 '14 at 20:22

Kamil Sindi · Answer 1 · 2016-10-24T15:11:19.133

57

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:

sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
    print(chunk)

Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

edited Oct 24 '16 at 15:11

answered Apr 08 '15 at 18:22

Kamil Sindi

16,733
15
85
113

2

**This is the way** to handle issues where RAM size < db_you_wish_to_load size – zelusp Oct 06 '16 at 13:05
E.g. a keyword argument skip_chunks=number_of_chunks, that can skip first few chunks without even reading them. The use case is reading end or middle of SQL table with over 1 million rows without resorting to SQL query or SQLalchemy. – denfromufa Oct 25 '16 at 02:36
For MySQL databases this is not the case. If your dataset won't fit in your available memory, it takes a lot of time to extract since MySQL does not have a server side cursor to work with. – MrMauricioLeite Nov 28 '16 at 12:29
41

Anyone relying on using the chunksize option should first read https://github.com/pandas-dev/pandas/issues/12265. For many databases, the entire dataset will still be read into memory whole, before an iterator is returned. For some databases, setting connection options appropriately can overcome this problem - for instance with Postgres, set execution_options={'stream_results': True} when creating the engine... – Janak Mayer Dec 01 '16 at 20:20
15

this does not save memory-- it pulls down the whole table and then chunks it. – eljusticiero67 Aug 08 '17 at 20:39
@eljusticiero67 see Janak Mayer's comment above – MichaelChirico Oct 12 '17 at 08:38
2

Check the last comment of Janak Mayer link – akniazi Nov 20 '18 at 06:57
1

@JanakMayer do you still need to specify `chunksize` when setting `stream_results`? – CutePoison Oct 02 '20 at 10:54

ThePhysicist · Accepted Answer · 2017-01-02T10:38:51.253

50

Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.

You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:

import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
  sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset) 
  dfs.append(psql.read_frame(sql, cnxn))
  offset += chunk_size
  if len(dfs[-1]) < chunk_size:
    break
full_df = pd.concat(dfs)

It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.

edited Jan 02 '17 at 10:38

answered Aug 07 '13 at 16:10

ThePhysicist

1,662
15
20

-Thanks, I will try this out, though I fear that the memory space may indeed be my issue. Also, Since I am using MS SQL-Server2008, the LIMIT and OFFSET SQL options are not available to me. Others should know to reference [here](http://stackoverflow.com/questions/187998/row-offset-in-sql-server) for the solution specific to their setup – slizb Aug 07 '13 at 17:01
3

you can also write these df's to a HDF5 file (the question you referenced uses that, also peruse docs, appending the tables: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables. Then read back (sections, or iterate as needed); HDF5 much more compact then SQL for data – Jeff Aug 07 '13 at 17:21
1

for postgres order by comes before limit : SELECT * FROM my_table order by id limit %d offset %d ; New pandas use read_sql instead of read_frame. – Aseem Sep 06 '18 at 22:32

score 6 · Answer 3 · answered Aug 26 '19 at 16:44

Code solution and remarks.

# Create empty list
dfl = []  

# Create empty dataframe
dfs = pd.DataFrame()  

# Start Chunking
for chunk in pd.read_sql(query, con=conct, ,chunksize=10000000):

    # Start Appending Data Chunks from SQL Result set into List
    dfl.append(chunk)

# Start appending data from list to dataframe
dfs = pd.concat(dfl, ignore_index=True)

However, my memory analysis tells me that even though the memory is released after each chunk is extracted, the list is growing bigger and bigger and occupying that memory resulting in a net net no gain on free RAM.

Would love to hear what the author / others have to say.

Saving chunks to disk, not saving a dataset, deleting chunks via "del" should be ok. You may also save chunks to a disk that you change its dtype to less memory consumption. And you may save the df as parquets format, then reading only needed columns. — Mehmet Burak Sayıcı, Mar 11 '21 at 10:17

score 0 · Answer 4 · answered Jan 14 '20 at 12:15

0

If you want to limit the number of rows in output, just use:

data = psql.read_frame(sql, cnxn,chunksize=1000000).__next__()

answered Jan 14 '20 at 12:15

Dharsanborn98

1
1

How to create a large pandas dataframe from an sql query without running out of memory?

4 Answers4

Linked

Related