I need to scan two large txt files (both about 100GB, 1 billion rows, several columns) and take out a certain column (write to new files). The files look like this
ID*DATE*provider
1111*201101*1234
1234*201402*5678
3214*201003*9012
...
My Python script is
N100 = 10000000 ## 1% of 1 billion rows
with open("myFile.txt") as f:
with open("myFile_c2.txt", "a") as f2:
perc = 0
for ind, line in enumerate(f): ## <== MemoryError
c0, c1, c2 = line.split("*")
f2.write(c2+"\n")
if ind%N100 == 0:
print(perc, "%")
perc+=1
Now the above script run well for one file but stuck for another one at 62%. The error message says MemoryError
for for ind, line in enumerate(f):
. I tried several times in different server with different RAM, the error is the same, all at 62%. I waited hours to monitor the RAM and it exploded to 28GB (total=32GB) when 62%. So I guess in that file there is a line that somehow too long (maybe not ended with \n
?) and thus Python stuck when trying reading it to the RAM.
So my question is, before I go to my data provider, what can I do to detect the error line and somehow get around/skip reading it as one huge line? Appreciate any suggestions!
EDIT:
The file, starting from the 'error line', might be all messed together with another line separator rather than \n
. If that's the case, can I detect the line sep and continue extracting the columns I want, rather than throwing away them? Thanks!