0

I'm using python 3.8 64-bit with Pycharm on windows 8. I got a memory error when trying to make a dataframe from a list.

What I'm trying to do is to read a huge .csv (25gb) into a list using the csv package, make a dataframe with it using pd.Dataframe, and then export a .dta file with the pd.to_stata function. My RAM is 64gb, way larger than the data.

Here is the error msg:

MemoryError: Unable to allocate 25.8 GiB for an array with shape (77058858, 45) and data type object

I found three similar questions but none of them work for me.

  1. question 1: the solution does not work for me because I am using 64-bit python.
  2. the answer in this question suggests that the memory error raise because the PC do not have enough memory, but I'm pretty sure I have enough ram for this data.
  3. In this one, the author get memory error trying to read a huge csv, and the solution is to read data by piece. I understand that I can do the same, but I wonder if there is a cleaner way to solve this problem

Here are my code:

import csv
import itertools
import pandas as pd

colname= ["id","attachmentPath",...(20 other column names),"eventid"]

reader = csv.reader(open(r'test.csv', encoding = "ISO-8859-1"), quotechar='"',delimiter=',', skipinitialspace=False, escapechar='\\')

# read full sample
records = []
for record in itertools.islice(reader,1,77058860): # 77058860 is the length of the csv
    records.append(record)

df = pd.DataFrame(records(reader,1,77058860): columns=colname)

statapath = r'stata_output.dta'
df.to_stata(statapath, version=117, write_index=False)
Phil
  • 43
  • 5
  • 1
    can you not read directly to dataframe with `pd.read_csv()` – Smurphy0000 Aug 16 '20 at 12:56
  • Because I also want to read a smaller random sample from it, so i will need the loop as defined in "for record in itertools.islice(reader,1,77058860)" – Phil Aug 16 '20 at 13:16

1 Answers1

0

I think your data set is too big for the amount of RAM. In a 2017 blog post, Wes McKinney (creator of Pandas), noted that:

To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM. [emphasis in McKinney's original document]

Source: https://wesmckinney.com/blog/apache-arrow-pandas-internals/

You would probably need to process the data set in chunks. Here are a couple ways that may reduce memory requirements, depending on the data set:

  • convert low-cardinality data to Categorical
  • use lowest-size numerical types (e.g., int64 to int8)

More info here: https://pandas.pydata.org/docs/user_guide/scale.html

jsmart
  • 2,600
  • 1
  • 3
  • 12
  • I see. I have checked my memory usage, it's near 95%. I guess it's also because I read in the csv as a list first (the list named records), which takes more than 25gb memory. anyway I will do this in chunks. Thank you for the explanation – Phil Aug 17 '20 at 02:40