Pandas read_csv() 1.2GB file out of memory on VM with 140GB RAM

Question

I am trying to read a CSV file of 1.2G, which contains 25K records, each consists of a id and a large string.

However, around 10K rows, I get this error:

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

Which seems weird, since the VM has 140GB RAM and at 10K rows the memory usage is only at ~1%.

This is the command I use:

pd.read_csv('file.csv', header=None, names=['id', 'text', 'code'])

I also ran the following dummy program, which could successfully fill up my memory to close to 100%.

list = []
list.append("hello")
while True:
    list.append("hello" + list[len(list) - 1])

did you try to use `chunksize` parameter and to read your file in chunks? — MaxU, Nov 06 '16 at 20:49
@MaxU then it would return an iterator and I would have to handle that in my code which I want to avoid if possible — David Frank, Nov 06 '16 at 20:53

kilojoules · Accepted Answer · 2018-06-01T19:18:11.703

12

This sounds like a job for chunksize. It splits the input process into multiple chunks, reducing the required reading memory.

df = pd.DataFrame()
for chunk in pd.read_csv('Check1_900.csv', header=None, names=['id', 'text', 'code'], chunksize=1000):
    df = pd.concat([df, chunk], ignore_index=True)

edited Jun 01 '18 at 19:18

answered Nov 06 '16 at 20:54

kilojoules

7,600
17
63
123

2

I would do it in a loop concatenating each chunk to the resulting DF: `df = pd.DataFrame(); for x in pd.read_csv(...): df = pd.concat([df, x], ignore_index=True)` - so we won't need RAM for __all__ chunks plus for the resulting DF – MaxU Nov 06 '16 at 20:58
Wow, nice :) Thanks, it works perfectly. Do you now, why the original approach fails? – David Frank Nov 06 '16 at 20:59
@DavidFrank You originally did not split file into chunks, resulting in too much memory being needed to read the file. Reading the smaller chunks was doable for your memory constraints. – kilojoules Nov 06 '16 at 21:04
1

@kilojoules But I have more than 100 times as much memory as required by the file, where did the overhead come from? – David Frank Nov 06 '16 at 21:06
@DavidFrank, what is your pandas version? – MaxU Nov 06 '16 at 21:08
@DavidFrank My broad understanding is that python's way of reading the file is memory-intensive. Even representing integer values is surprisingly memory intensive in python. – kilojoules Nov 06 '16 at 21:10
This is like running away from problem, there something wrong when reading 1.2GB file into RAM and it takes 140GB – Jemshit Iskenderov Jul 19 '18 at 07:05
@JemshitIskenderov It does not say it takes 140GB, only that is the theoretical available memory space. – MattSom May 15 '20 at 15:09

score 2 · Answer 2 · edited Dec 12 '17 at 03:25

This error can occur with an invalid csv file, rather than the stated memory error.

I got this error with a file that was much smaller than my available RAM and it turned out that there was an opening double quote on one line without a closing double quote.

In this case, you can check the data, or you can change the quoting behavior of the parser, for example by passing quoting=3 to pd.read_csv.

score 2 · Answer 3 · answered Jan 04 '19 at 23:50

This is weird.

Actually I ran into the same situation.

df_train = pd.read_csv('./train_set.csv')

But after I tried a lot of stuff to solve this error. And it works. Like this:

dtypes = {'id': pd.np.int8,
          'article':pd.np.str,
          'word_seg':pd.np.str,
          'class':pd.np.int8}
df_train = pd.read_csv('./train_set.csv', dtype=dtypes)
df_test = pd.read_csv('./test_set.csv', dtype=dtypes)

Or this:

ChunkSize = 10000
i = 1
for chunk in pd.read_csv('./train_set.csv', chunksize=ChunkSize): #分块合并
    df_train = chunk if i == 1 else pd.concat([df_train, chunk])
    print('-->Read Chunk...', i)
    i += 1

BUT!!!!!Suddenlly the original version works fine as well!

Like I did some useless work and I still have no idea where really went wrong.

I don't know what to say.

I also faced similar frustrations of inconsistency. However since it's a memory error, it makes sense that it can be inconsistent in when the error occurs. E.g. An outside process that you're not aware of which is running can eat up memory, or maybe the garbage collector decided to collect during the successful times. I still believe it's best to go with the safer approach and use one of the methods you found that reduce memory consumption to avoid future errors. So I don't think your work was wasted. — Kt Mack, Jan 09 '19 at 14:45

score 1 · Answer 4 · answered Dec 31 '19 at 00:34

You can use the command df.info(memory_usage="deep"), to find out the memory usage of data being loaded in the data frame.

Few things to reduce Memory:

Only load columns you need in the processing via usecols table.
Set dtypes for these columns
If your dtype is Object / String for some columns, you can try using the dtype="category". In my experience it reduced the memory usage drastically.

Anurag Trivedi · Answer 5 · 2020-11-16T14:57:35.950

I used the below code to load csv in chunks while removing the intermediate file to manage memory, and view % of loading in real time: Note: 96817414 is the number of rows in my csv

import pandas as pd
import gc
cols=['col_name_1', 'col_name_2', 'col_name_3']
df = pd.DataFrame()
i = 0
for chunk in pd.read_csv('file.csv', chunksize=100000, usecols=cols):
    df = pd.concat([df, chunk], ignore_index=True)
    del chunk; gc.collect()
    i+=1
    if i%5==0:
        print("% of read completed", 100*(i*100000/96817414))

Pandas read_csv() 1.2GB file out of memory on VM with 140GB RAM

5 Answers5

Linked