46

I have a problem to parse 1000's of text files(around 3000 lines in each file of ~400KB size ) in a folder. I did read them using readlines,

   for filename in os.listdir (input_dir) :
       if filename.endswith(".gz"):
          f = gzip.open(file, 'rb')
       else:
          f = open(file, 'rb')

       file_content = f.readlines()
       f.close()
   len_file = len(file_content)
   while i < len_file:
       line = file_content[i].split(delimiter) 
       ... my logic ...  
       i += 1  

This works completely fine for sample from my inputs (50,100 files) . When I ran on the whole input more than 5K files, the time-taken was nowhere close to linear increment.I planned to do an performance analysis and did a Cprofile analysis. The time taken for the more files in exponentially increasing with reaching worse rates when inputs reached to 7K files.

Here is the the cumulative time-taken for readlines , first -> 354 files(sample from input) and second -> 7473 files (whole input)

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 354    0.192    0.001    **0.192**    0.001 {method 'readlines' of 'file' objects}
 7473 1329.380    0.178  **1329.380**    0.178 {method 'readlines' of 'file' objects}

Because of this, the time-taken by my code is not linearly scaling as the input increases. I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

I agree with this point, but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ? But, there is some catch here. Can somebody give some insights into this issue.

Is this an inherent behavior of readlines() or my wrong interpretation of python garbage collector. Glad to know.

Also, suggest some alternative ways of doing the same in memory and time efficient manner. TIA.

Maximilian Peters
  • 24,648
  • 11
  • 64
  • 80
Learner
  • 1,575
  • 5
  • 23
  • 40
  • 4
    As a side note, there is never a good reason to write `len_file = len(file_content)`, then a `while( i < len_file ):` loop with `i += 1` and `file_content[i]` inside. Just use `for line in file_content:`. If you also need `i` for something else, use `for i, line in enumerate(file_content)`. You're making things harder for yourself and your readers (and for the interpreter, which means your code may run slower, but that's usually much less important here). – abarnert Jun 22 '13 at 01:17
  • Thanks @abarnert. I'll change them . – Learner Jun 22 '13 at 01:28
  • 4
    One last style note: In Python, you can just write `if filename.endswith(".gz"):`; you don't need parentheses around the condition, and shouldn't use them. One of the great things about Python is how easy it is both to skim quickly and to read in-depth, but putting in those parentheses makes it much harder to skim (because you have to figure out whether there's a multi-line expression, a tuple, a genexp, or just code written by a C/Java/JavaScript programmer). – abarnert Jun 24 '13 at 18:10
  • Nice tip,duly noted. Will change them as well. – Learner Jun 24 '13 at 23:35

2 Answers2

92

The short version is: The efficient way to use readlines() is to not use it. Ever.


I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

The documentation for readlines() explicitly guarantees that it reads the whole file into memory, and parses it into lines, and builds a list full of strings out of those lines.

But the documentation for read() likewise guarantees that it reads the whole file into memory, and builds a string, so that doesn't help.


On top of using more memory, this also means you can't do any work until the whole thing is read. If you alternate reading and processing in even the most naive way, you will benefit from at least some pipelining (thanks to the OS disk cache, DMA, CPU pipeline, etc.), so you will be working on one batch while the next batch is being read. But if you force the computer to read the whole file in, then parse the whole file, then run your code, you only get one region of overlapping work for the entire file, instead of one region of overlapping work per read.


You can work around this in three ways:

  1. Write a loop around readlines(sizehint), read(size), or readline().
  2. Just use the file as a lazy iterator without calling any of these.
  3. mmap the file, which allows you to treat it as a giant string without first reading it in.

For example, this has to read all of foo at once:

with open('foo') as f:
    lines = f.readlines()
    for line in lines:
        pass

But this only reads about 8K at a time:

with open('foo') as f:
    while True:
        lines = f.readlines(8192)
        if not lines:
            break
        for line in lines:
            pass

And this only reads one line at a time—although Python is allowed to (and will) pick a nice buffer size to make things faster.

with open('foo') as f:
    while True:
        line = f.readline()
        if not line:
            break
        pass

And this will do the exact same thing as the previous:

with open('foo') as f:
    for line in f:
        pass

Meanwhile:

but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ?

Python doesn't make any such guarantees about garbage collection.

The CPython implementation happens to use refcounting for GC, which means that in your code, as soon as file_content gets rebound or goes away, the giant list of strings, and all of the strings within it, will be freed to the freelist, meaning the same memory can be reused again for your next pass.

However, all those allocations, copies, and deallocations aren't free—it's much faster to not do them than to do them.

On top of that, having your strings scattered across a large swath of memory instead of reusing the same small chunk of memory over and over hurts your cache behavior.

Plus, while the memory usage may be constant (or, rather, linear in the size of your largest file, rather than in the sum of your file sizes), that rush of mallocs to expand it the first time will be one of the slowest things you do (which also makes it much harder to do performance comparisons).


Putting it all together, here's how I'd write your program:

for filename in os.listdir(input_dir):
    with open(filename, 'rb') as f:
        if filename.endswith(".gz"):
            f = gzip.open(fileobj=f)
        words = (line.split(delimiter) for line in f)
        ... my logic ...  

Or, maybe:

for filename in os.listdir(input_dir):
    if filename.endswith(".gz"):
        f = gzip.open(filename, 'rb')
    else:
        f = open(filename, 'rb')
    with contextlib.closing(f):
        words = (line.split(delimiter) for line in f)
        ... my logic ...
Boris
  • 7,044
  • 6
  • 62
  • 63
abarnert
  • 313,628
  • 35
  • 508
  • 596
  • I should have told this earlier. My inputs directory might contain gzip file and also normal text file - so for file open i'am using a if else construct. I'm afraid this 'with' might not work out. – Learner Jun 22 '13 at 01:10
  • @Learner: Sure it will: `with open('foo', 'rb') as f:`, then you can create a `GzipFile(fileobj=f)` if necessary (or an `io.IOTextWrapper` if it's a text file you want decoded to `unicode`, or a `csv.reader` if it's a CSV file you want decoded to rows, etc.). At any rate, the `with` part isn't relevant here; all of the options are exactly the same options with explicit `close`, except more verbose and less robust. – abarnert Jun 22 '13 at 01:13
  • I'm not sure if I understood the iotextwrapper part. Any links to follow ? TIA :) – Learner Jun 22 '13 at 01:34
  • 1
    @Learner: I'm assuming you're using Python 2, yes? If so, the reference docs are [here](http://docs.python.org/2/library/io.html#io.TextIOWrapper), and the way to learn is… read the differences between Python 2 text files and Python 3 text files (maybe start [here](http://docs.python.org/3.2/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit)); `io.TextIOWrapper` turns the former into the latter, so you can write clean Py3-style code that only deals with `unicode` objects, not encoded bytes, even in Py2. – abarnert Jun 22 '13 at 01:43
  • Thanks @abarnert, I used the last method you quoted with contextlib.closing() - worked great. My results for the time-taken reduced and also, the program scaled linearly in time, even as the inputs grew :) – Learner Jun 24 '13 at 05:43
  • 1
    @Learner: Glad it helped. `closing` isn't useful that often—most of the time, you've just got a `file` or something else that can be used directly in a `with` statement—but it is handy to know for cases like this. Anyway, the important part (the part that sped up your code) is using the `file` (or `GzipFile`) directly as an iterable, instead of `readline()`-ing the whole thing into memory to use the `list` as an iterable, as Óscar López explained before me. – abarnert Jun 24 '13 at 18:06
18

Read line by line, not the whole file:

for line in open(file_name, 'rb'):
    # process line here

Even better use with for automatically closing the file:

with open(file_name, 'rb') as f:
    for line in f:
        # process line here

The above will read the file object using an iterator, one line at a time.

Óscar López
  • 215,818
  • 33
  • 288
  • 367
  • I understand what you mean by this. I would like to know the root-cause as well.What is the reason behind readlines() being slow ? Thanks ! – Learner Jun 22 '13 at 00:54
  • 1
    That `readlines` will read _the whole file_ at once into a list, which can be a problem if it's big - it'll use a lot of memory! – Óscar López Jun 22 '13 at 00:55
  • Yeah, but given that my file is 400KB (<0.5MB) and will be disposed from memory at the end of every iteration, readlines() reading the whole file shouldn't be a problem right ? – Learner Jun 22 '13 at 00:57
  • But anyway you'll be creating a lot of potentially big lists that get discarded immediately, but not really freed from memory until the next run of the garbage collector. In Python, the preferred style is using iterators, generator expressions, etc. - never create a new, big object when you can process little chunks of it at a time – Óscar López Jun 22 '13 at 01:00
  • @ÓscarLópez: Actually, at least in CPython, the GC generally frees up memory (not back to the OS, but to the internal free list) as soon as the name referencing it gets rebound or goes away, so that first part isn't really an issue. But your larger point is 100% right. Iterators make everything better. – abarnert Jun 22 '13 at 01:02
  • oh oh ! so, they will be still be consuming part of my program's memory and hence will slow the running once the number of files gets increasing ? Also, this will be cleared only after my programs ends running ? – Learner Jun 22 '13 at 01:03
  • @Learner: No, that's probably not the problem. – abarnert Jun 22 '13 at 01:04
  • 1
    Yes, you'll be consuming memory and eventually you'll start paging into disk if the physical memory runs out. And no, the GC is not deterministic, so you can't tell when the memory is going to be freed - in fact, part of the reasons for the slowdown could be the GC running – Óscar López Jun 22 '13 at 01:05
  • @ÓscarLópez: Yes, the GC _is_ deterministic in the CPython implementation, which the OP is almost certainly using (since he would have said Jython or Iron or PyPy if he were using them). – abarnert Jun 22 '13 at 01:06
  • @abarnert can you please provide a reference stating that is, in fact, deterministic? – Óscar López Jun 22 '13 at 01:07
  • 1
    @ÓscarLópez: http://docs.python.org/2/c-api/intro.html#reference-counts documents how the refcounting works. (The documentation on cycle breaking is elsewhere, but not relevant here.) The proof that it's deterministic is trivial: a pure refcounting GC is deterministic by definition (and a refcounting-plus-cycle-breaking GC is likewise deterministic when there are no cycles). – abarnert Jun 22 '13 at 01:11
  • @ÓscarLópez: Do you really not believe that CPython is refcounted, or that refcounting is deterministic, or are you just being a stickler here? – abarnert Jun 22 '13 at 01:14
  • Paging is one problem I was expecting this to run-into. – Learner Jun 22 '13 at 01:14
  • @abarnert I'm just curious. The link you provided is interesting, but I'm left wondering how _frequently_ does the GC run. Sure, the ref counting algorithm is deterministic, but can we predict when will it run? that's why I'm stating that it's non-deterministic - you don't know when it'll reclaim memory – Óscar López Jun 22 '13 at 01:17
  • 3
    @ÓscarLópez: The whole point of refcounting is that _it doesn't have to run_. Every time a reference goes away (e.g., a name is rebound or goes out of scope), the count on the referenced object is decreased, and if it reaches 0, the object is reclaimed _immediately_. (The cycle detector is another, more complicated story, but again, it's not relevant here, because there are no cycles in the OP's code.) The [Wikipedia article](http://en.wikipedia.org/wiki/Reference_counting) explains it pretty well. – abarnert Jun 22 '13 at 01:20
  • 2
    @abarnert thanks for clarifying that, I learnt something new :) – Óscar López Jun 22 '13 at 01:27
  • 1
    @ÓscarLópez: Just keep in mind that this is only a feature of CPython, not all Python implementations (e.g., Jython doesn't refcount, and relies on the Java generational GC), and that even with CPython it's not always obvious when there are cycles (especially in an interactive session or the debugger), so your habits of using `with` whenever possible, etc. are definitely worth keeping. – abarnert Jun 22 '13 at 01:31