How can I read large text files in Python, line by line, without loading it into memory?

Question

I need to read a large file, line by line. Lets say that file has more than 5GB and I need to read each line, but obviously I do not want to use readlines() because it will create a very large list in the memory.

How will the code below work for this case? Is xreadlines itself reading one by one into memory? Is the generator expression needed?

f = (line for line in open("log.txt").xreadlines())  # how much is loaded in memory?

f.next()

Plus, what can I do to read this in reverse order, just as the Linux tail command?

I found:

http://code.google.com/p/pytailer/

and

"python head, tail and backward read by lines of a text file"

Both worked very well!

And What can I do to read this from tail? line by line, starting in the last line. — Bruno Rocha - rochacbruno, Jun 25 '11 at 02:11
duplicate http://stackoverflow.com/questions/5896079/python-head-tail-and-backward-read-by-lines-of-a-text-file — cmcginty, Jun 25 '11 at 02:38

John La Rooy · Accepted Answer · 2011-06-25T03:28:39.967

359

I provided this answer because Keith's, while succinct, doesn't close the file explicitly

with open("log.txt") as infile:
    for line in infile:
        do_something_with(line)

edited Jun 25 '11 at 03:28

answered Jun 25 '11 at 02:26

John La Rooy

263,347
47
334
476

37

the question still is, "for line in infile" will load my 5GB of lines in to the memory? and, How can I read from tail? – Bruno Rocha - rochacbruno Jun 25 '11 at 02:31
80

@rochacbruno, it only reads one line at a time. When the next line is read, the previous one will be garbage collected unless you have stored a reference to it somewhere else – John La Rooy Jun 25 '11 at 02:33
1

@rochacbruno, Reading the lines in reverse order is not as easy to do efficiently unfortunately. Generally you would want to read from the end of the file in sensible sized chunks (kilobytes to megabytes say) and split on newline characters ( or whatever the line ending char is on your platform) – John La Rooy Jun 25 '11 at 02:36
4

Thanks! I found the tail solution http://stackoverflow.com/questions/5896079/python-head-tail-and-backward-read-by-lines-of-a-text-file/5896210#5896210 – Bruno Rocha - rochacbruno Jun 25 '11 at 03:09
@gnibbler: In practice, Keith's version should close the file in CPython due to the peculiar semantics of Python's memory management. – Dietrich Epp Jun 25 '11 at 03:25
@Dietrich, you are correct. Most of us have taken advantage of that at some time or another I imagine. Even in implementations such as Jython, the file is closed when it is garbarge collected, it's just not deterministic when it will be collected – John La Rooy Jun 25 '11 at 03:31
@JohnLaRooy what if a line itself is super long ? – bawejakunal Nov 18 '16 at 05:39
It should be noted that Python will convert line endings of `\r\n` (old DOS)or `\r` (old Mac) to simply `\n` (modern *Nix/Everything). It is a feature that I appreciate, but if you are not aware of it, you could have unexpected results. See the comments on my answer below https://stackoverflow.com/a/45623945/117471 – Bruno Bronosky Aug 11 '17 at 13:22
I tested it and works fine, tried with logfile of more than 50MB – resgef Jan 08 '18 at 16:16
4

@bawejakunal, Do you mean if a line is too long to load into memory at once? That is unusual for a _text_ file. Instead of using `for` loop which iterates over the lines, you can use `chunk = infile.read(chunksize)` to read limited size chunks regardless of their content. You'll have to search inside the chunks for newlines yourself. – John La Rooy Jan 09 '18 at 21:50
I have issue with a huge file while loading using python here is the link https://stackoverflow.com/questions/66675216/copying-a-huge-file-using-python-scripts – dileepVikram Mar 17 '21 at 22:23

Keith · Answer 2 · 2011-06-25T06:45:26.803

67

All you need to do is use the file object as an iterator.

for line in open("log.txt"):
    do_something_with(line)

Even better is using context manager in recent Python versions.

with open("log.txt") as fileobject:
    for line in fileobject:
        do_something_with(line)

This will automatically close the file as well.

edited Jun 25 '11 at 06:45

answered Jun 25 '11 at 02:07

Keith

37,985
10
48
67

3

That is not loading whole file in to the memory? – Bruno Rocha - rochacbruno Jun 25 '11 at 02:10

score 16 · Answer 3 · answered Jun 25 '11 at 02:06

16

You are better off using an iterator instead. Relevant: http://docs.python.org/library/fileinput.html

From the docs:

import fileinput
for line in fileinput.input("filename"):
    process(line)

This will avoid copying the whole file into memory at once.

answered Jun 25 '11 at 02:06

Mikola

8,569
1
31
41

Although the docs show the snippet as "typical use", using it does not call the `close()` method of the returned `FileInput` class object when the loop finishes -- so I would avoid using it this way. In Python 3.2 they've finally made `fileinput` compatible with the context manager protocol which addresses this issue (but the code still wouldn't be written quite the way shown). – martineau Jul 24 '12 at 03:50

score 16 · Answer 4 · answered Jun 25 '11 at 02:31

16

An old school approach:

fh = open(file_name, 'rt')
line = fh.readline()
while line:
    # do stuff with line
    line = fh.readline()
fh.close()

answered Jun 25 '11 at 02:31

PTBNL

5,743
4
26
31

2

minor remark: for exception safety it is recommended to use 'with' statement, in your case "with open(filename, 'rt') as fh:" – prokher Jan 15 '15 at 14:44
17

@prokher: Yeah, but I did call this "old school". – PTBNL Jan 16 '15 at 13:40

score 11 · Answer 5 · answered May 06 '18 at 15:20

11

Here's what you do if you dont have newlines in the file:

with open('large_text.txt') as f:
  while True:
    c = f.read(1024)
    if not c:
      break
    print(c)

answered May 06 '18 at 15:20

Ariel Cabib

1,886
17
11

While I like this method, you run the risk of having line in your text broken into chunks. I saw this personally, which means that if you are searching for sstring in the file like I was, I'd miss some because the line they were at were broken into chunks. Is there a way to get around this? Using readlines didn't work well as i got miscounts @Ariel Cabib – edo101 Jun 03 '20 at 02:37

score 8 · Answer 6 · edited Jan 25 '18 at 15:14

8

Please try this:

with open('filename','r',buffering=100000) as f:
    for line in f:
        print line

edited Jan 25 '18 at 15:14

Daniel Trugman

6,100
14
35

answered Jan 25 '18 at 14:48

jyoti das

96
1
4

please explain? – Nikhil VJ Mar 31 '18 at 04:00
5

From Python's official docmunets: [link](https://docs.python.org/2/library/functions.html#open) The optional buffering argument specifies the file’s desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used – jyoti das Apr 19 '18 at 05:26
Saved my day, in my case, with >~4gb files with two file handlers (one read, the other write) python was hanging and now it's fine! Thanks. – Xelt Apr 23 '19 at 13:37
@jyotidas While I like this method, you run the risk of having line in your text broken into chunks. I saw this personally, which means that if you are searching for sstring in the file like I was, I'd miss some because the line they were at were broken into chunks. Is there a way to get around this? Using readlines didn't work well as i got miscounts – edo101 Jun 03 '20 at 02:37

Bruno Bronosky · Answer 7 · 2017-08-11T03:43:26.500

4

I couldn't believe that it could be as easy as @john-la-rooy's answer made it seem. So, I recreated the cp command using line by line reading and writing. It's CRAZY FAST.

#!/usr/bin/env python3.6

import sys

with open(sys.argv[2], 'w') as outfile:
    with open(sys.argv[1]) as infile:
        for line in infile:
            outfile.write(line)

edited Aug 11 '17 at 03:43

answered Aug 10 '17 at 21:48

Bruno Bronosky

54,357
9
132
120

NOTE: Because python's `readline` standardizes line endings, this has the side effect of converting documents with DOS line endings of `\r\n` to Unix line endings of `\n`. My whole reason for searching out this topic was that I needed to convert a log file that receives a jumble of line endings (because the developer blindly used various .NET libraries). I was shocked to find that after my initial speed test, I didn't need to go back and `rstrip` the lines. It was already perfect! – Bruno Bronosky Aug 11 '17 at 13:13

score 3 · Answer 8 · answered Jan 22 '18 at 20:51

The blaze project has come a long way over the last 6 years. It has a simple API covering a useful subset of pandas features.

dask.dataframe takes care of chunking internally, supports many parallelisable operations and allows you to export slices back to pandas easily for in-memory operations.

import dask.dataframe as dd

df = dd.read_csv('filename.csv')
df.head(10)  # return first 10 rows
df.tail(10)  # return last 10 rows

# iterate rows
for idx, row in df.iterrows():
    ...

# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()

# slice by column
df[df.my_field=='XYZ'].compute()

score 2 · Answer 9 · answered Jul 25 '18 at 02:32

Heres the code for loading text files of any size without causing memory issues. It support gigabytes sized files

https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d

download the file data_loading_utils.py and import it into your code

usage

import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000


def process_lines(data, eof, file_name):

    # check if end of file reached
    if not eof:
         # process data, data is one single line of the file

    else:
         # end of file reached

data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)

process_lines method is the callback function. It will be called for all the lines, with parameter data representing one single line of the file at a time.

You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.

While I like this method, you run the risk of having line in your text broken into chunks. I saw this personally, which means that if you are searching for sstring in the file like I was, I'd miss some because the line they were at were broken into chunks. Is there a way to get around this? Using readlines didn't work well as i got miscounts — edo101, Jun 03 '20 at 02:38

score 0 · Answer 10 · answered Oct 25 '17 at 00:30

How about this? Divide your file into chunks and then read it line by line, because when you read a file, your operating system will cache the next line. If you are reading the file line by line, you are not making efficient use of the cached information.

Instead, divide the file into chunks and load the whole chunk into memory and then do your processing.

def chunks(file,size=1024):
    while 1:

        startat=fh.tell()
        print startat #file's object current position from the start
        fh.seek(size,1) #offset from current postion -->1
        data=fh.readline()
        yield startat,fh.tell()-startat #doesnt store whole list in memory
        if not data:
            break
if os.path.isfile(fname):
    try:
        fh=open(fname,'rb') 
    except IOError as e: #file --> permission denied
        print "I/O error({0}): {1}".format(e.errno, e.strerror)
    except Exception as e1: #handle other exceptions such as attribute errors
        print "Unexpected error: {0}".format(e1)
    for ele in chunks(fh):
        fh.seek(ele[0])#startat
        data=fh.read(ele[1])#endat
        print data

This looks promising. Is this loading by bytes or by lines? I'm afraid of lines being broken if it's by bytes.. how can we load say 1000 lines at a time and process that? — Nikhil VJ, Mar 31 '18 at 03:59

score 0 · Answer 11 · edited Jan 18 '18 at 16:30

Thank you! I have recently converted to python 3 and have been frustrated by using readlines(0) to read large files. This solved the problem. But to get each line, I had to do a couple extra steps. Each line was preceded by a "b'" which I guess that it was in binary format. Using "decode(utf-8)" changed it ascii.

Then I had to remove a "=\n" in the middle of each line.

Then I split the lines at the new line.

b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
        a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
        data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
        data_list = data_chunk.split('\n')  #List containing lines in chunk
        #print(data_list,'\n')
        #time.sleep(1)
        for j in range(len(data_list)): #iterate through data_list to get each item 
            i += 1
            line_of_data = data_list[j]
            print(line_of_data)

Here is the code starting just above "print data" in Arohi's code.

Ali Sajjad · Answer 12 · 2020-05-02T13:37:31.590

0

The best solution I found regarding this, and I tried it on 330 MB file.

lineno = 500
line_length = 8
with open('catfour.txt', 'r') as file:
    file.seek(lineno * (line_length + 2))
    print(file.readline(), end='')

Where line_length is the number of characters in a single line. For example "abcd" has line length 4.

I have added 2 in line length to skip the '\n' character and move to the next character.

edited May 02 '20 at 13:37

answered May 02 '20 at 12:46

Ali Sajjad

828
5
16

Amiga500 · Answer 13 · 2021-03-02T13:52:47.810

I realise this has been answered quite some time ago, but here is a way of doing it in parallel without killing your memory overhead (which would be the case if you tried to fire each line into the pool). Obviously swap the readJSON_line2 function out for something sensible - its just to illustrate the point here!

Speedup will depend on filesize and what you are doing with each line - but worst case scenario for a small file and just reading it with the JSON reader, I'm seeing similar performance to the ST with the settings below.

Hopefully useful to someone out there:

def readJSON_line2(linesIn):
  #Function for reading a chunk of json lines
   '''
   Note, this function is nonsensical. A user would never use the approach suggested 
   for reading in a JSON file, 
   its role is to evaluate the MT approach for full line by line processing to both 
   increase speed and reduce memory overhead
   '''
   import json

   linesRtn = []
   for lineIn in linesIn:

       if lineIn.strip() != 0:
           lineRtn = json.loads(lineIn)
       else:
           lineRtn = ""
        
       linesRtn.append(lineRtn)

   return linesRtn




# -------------------------------------------------------------------
if __name__ == "__main__":
   import multiprocessing as mp

   path1 = "C:\\user\\Documents\\"
   file1 = "someBigJson.json"

   nBuffer = 20*nCPUs  # How many chunks are queued up (so cpus aren't waiting on processes spawning)
   nChunk = 1000 # How many lines are in each chunk
   #Both of the above will require balancing speed against memory overhead

   iJob = 0  #Tracker for SMP jobs submitted into pool
   iiJob = 0  #Tracker for SMP jobs extracted back out of pool

   jobs = []  #SMP job holder
   MTres3 = []  #Final result holder
   chunk = []  
   iBuffer = 0 # Buffer line count
   with open(path1+file1) as f:
      for line in f:
            
          #Send to the chunk
          if len(chunk) < nChunk:
              chunk.append(line)
          else:
              #Chunk full
              #Don't forget to add the current line to chunk
              chunk.append(line)
                
              #Then add the chunk to the buffer (submit to SMP pool)                  
              jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
              iJob +=1
              iBuffer +=1
              #Clear the chunk for the next batch of entries
              chunk = []
                            
          #Buffer is full, any more chunks submitted would cause undue memory overhead
          #(Partially) empty the buffer
          if iBuffer >= nBuffer:
              temp1 = jobs[iiJob].get()
              for rtnLine1 in temp1:
                  MTres3.append(rtnLine1)
              iBuffer -=1
              iiJob+=1
            
      #Submit the last chunk if it exists (as it would not have been submitted to SMP buffer)
      if chunk:
          jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
          iJob +=1
          iBuffer +=1

      #And gather up the last of the buffer, including the final chunk
      while iiJob < iJob:
          temp1 = jobs[iiJob].get()
          for rtnLine1 in temp1:
              MTres3.append(rtnLine1)
          iiJob+=1

   #Cleanup
   del chunk, jobs, temp1
   pool.close()

score -1 · Answer 14 · edited Oct 11 '19 at 00:01

-1

This might be useful when you want to work in parallel and read only chunks of data but keep it clean with new lines.

def readInChunks(fileObj, chunkSize=1024):
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        while data[-1:] != '\n':
            data+=fileObj.read(1)
        yield data

edited Oct 11 '19 at 00:01

Ahmad

58,947
17
107
133

answered May 10 '19 at 12:00

Adam

182
12

How can I read large text files in Python, line by line, without loading it into memory?

14 Answers14

Linked

Related