-1

I have a file with 100s of thousands of records, one per line. I need to read 100, process them, read another 100, process them and so forth. I don't want to load those many records and keep them in memory. How do I read (until EOF) either 100 or less (when EOF is encountered) lines from an open file using Python?

Dervin Thunk
  • 17,583
  • 26
  • 110
  • 202
  • please define "record" – notorious.no Apr 08 '15 at 19:45
  • 1
    Call `readline()` 100 times... stop calling it if you hit EOF...? – John Kugelman Apr 08 '15 at 19:46
  • Is there a specific reason you need to process them 100 at a time, rather than one at a time, or 64 at a time, or whatever? Is this to do with buffering, or is there something in particular about 100? – DNA Apr 08 '15 at 19:49
  • Related to : http://stackoverflow.com/questions/24716001/python-reading-in-a-text-file-in-a-set-line-range . The accepted answer seems to fit your needs, with a little tweaking so you can do the 100 first, then 100 others, etc. I don't know the impact on the memory though. –  Apr 08 '15 at 19:51
  • @DNA: I need to process 100 at a time because I'm using an api with a cap on the number of calls I can make. I can obviously parameterize the value for other APIs. – Dervin Thunk Apr 08 '15 at 19:52
  • Take this! http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python – Paulo Abreu Apr 08 '15 at 19:56

5 Answers5

7

islice() can be used to retrieve the next n items of an iterator.

from itertools import islice

with open(...) as file:
    while True:
        lines = list(islice(file, 100))
        for line in lines:
            # do stuff
        if not lines:
            break
ILostMySpoon
  • 2,359
  • 2
  • 15
  • 24
  • It works well. I would suggest putting the ``if not lines`` before ``do stuff`` to avoid a blank list at the end of the while loop. – Libin Wen Jun 11 '17 at 05:24
2
with open('file.txt', 'r') as f:
    workset = [] # start a work set
    for line in f: # iterate over file
        workset.append(line) # add current line to work set
        if len(workset) == 100: # if 100 items in work set,
            dostuff(workset) # send work set to processing
            workset = [] # make a new work set
    if workset: # if there's an unprocessed work set at the end (<100 items),
        dostuff(workset) # process it
TigerhawkT3
  • 44,764
  • 6
  • 48
  • 82
2

A runnable example using the take recipe from the itertools page:

from itertools import islice

# Recipe from https://docs.python.org/2/library/itertools.html
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

if __name__ == "__main__":
   with open('data.txt', 'r') as f:
     while True:
       lines = take(100, f)
       if lines:
         print(lines)
       else:
         break
Glorfindel
  • 19,729
  • 13
  • 67
  • 91
DNA
  • 40,109
  • 12
  • 96
  • 136
1

You could utilize i_zip_longest in the grouper recipe, which would also address your EOF issue:

with open("my_big_file") as f:
    for chunk_100 in izip_longest(*[f] * 100)
          #record my lines

Here we are simply iterating over our file lines, and specifying our fixed-length chunk to be 100 lines.

A simple example of the grouper recipe (from the docs):

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)
miradulo
  • 24,374
  • 5
  • 66
  • 85
1
file.readlines(sizehint= <line size in Bytes> )

instead of creating your own iterator, you can use the built-in one.

python's method file.readlines() returns a list of all the lines in the file. if the file is too big it wont fit in memory.

so, you can use the parameter sizehint. it will read the sizehint Bytes(and not lines) from the file, and enough more to complete a line, and returns the lines from that.

Only complete lines will be returned.

for example:

file.readlines(sizehint=1000)

it will read the 1000 Bytes from the file.

ggcarmi
  • 367
  • 2
  • 11