How to process each line of a text file having more than 50 million lines in an optimal way using Python?

Question

I have file which has more than 50 million lines in it. Each line starts with some specific two character code. Sample file rows are:
AB1357 0000 -9999 XFAB ...
AB1358 0000 -9999 XABC ...
BC3233 1322 -8638 SCDR ...
As you can see first two characters of each line is a code. I have to apply some processing on each row based on the "code" that he line has. Right now i am processing the file line by line which is taking a lot of time. Is there any way I can optimize this? I am using Python.
Note : I already have the list of all possible 60 codes.

Since you obviously want to process every line I am not sure what your intention with the question is. You need to process every line in order to process every line — FlyingTeller, Jan 22 '18 at 12:44
are your lines all equal size-wise? anyway if you need all the info you have to read the file fully... — Jean-François Fabre, Jan 22 '18 at 12:45
Since i am writing all same code lines into a specific file after processing, I wanted to know if there is any way i can group lines with same code and process them at once? — KUNAL SHARMA, Jan 22 '18 at 12:54
line size depends on the code and i need all the info from line — KUNAL SHARMA, Jan 22 '18 at 13:00
This may be an X-Y problem. It sounds like the code is slow, not the reading of the file. — James, Jan 22 '18 at 13:01
@KUNALSHARMA Did either of the 2 answers below help? Feel free to select one so others can view a tested solution. — jpp, Jan 23 '18 at 19:47

score 0 · Answer 1 · answered Jan 22 '18 at 13:02

One typical workflow for this kind of problem is to "lazy load" the file using blaze framework (or dask.dataframe), then sequentially: slice according to each code, load in memory, perform your operations, export results.

This assumes that each slice can fit in memory.

If your input file is in csv format, you can do something like this:

import dask.dataframe as dd

df = dd.read_csv('InputFile.csv', header=None, names=['Id', 'Col1', 'Col2', 'Col3'])

codes = ['AB', 'AC', 'AD']

for code in codes:
    df_slice = df[df['Id'].str.startswith(code)]

    # bring slice in memory
    df_slice_pandas = df_slice.compute()

    # perform your calculations here

    # export slice + results to file
    df_slice_pandas.to_csv('OutputFile_'+code+'.csv', index=False)

score -1 · Answer 2 · answered Jan 22 '18 at 14:46

First I thought that you required a way to read a file line by line for a large file size. I stumbled upon a few similar posts to yours:

Reading large text files without loading it into memory and the accepted answer. Read the comments of the accepted answer too.
Fastest way to Read/Write with large text files line by line and the accepted answer.

You might be limited by your hardware.

If you do not require all lines to be processed at once perhaps you can implement a fast string pattern recognition/search algorithm that will locate the two character code of interest since you have a list of them.

This dude, Aaron, bypasses the "reading line by line" part and loads the file into RAM.

You could try creating chunks of the large file and then use pythons multithreading library. Or try a python dictionary.

Hit that google button. All cred to the original authors.

Student_23

How to process each line of a text file having more than 50 million lines in an optimal way using Python?

2 Answers2