3

I have a list of 4 million words in a txt file that I want to add to a list. I have two options:

l=[line for line in open(wordlist)]

or:

wordlist = file.readlines()

readlines() appears to be much faster, I'm guessing this is because the data is read into the memory in one go. The first option would be better for conserving memory because it reads one line at a time, is this true? Does readlines() use any type of buffer when copying? In general which is best to use?

Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
KexAri
  • 3,607
  • 4
  • 35
  • 70
  • You may read one line at a time with readline() or all lines with readlines() if you have enough space in memory in both cases to handle the resulting list... What will you do with the list of word after, that will tell you what you should do... Having 4 millions word list if you don't need to have access to them at any moment doesn't make sens. – Richard Aug 28 '15 at 17:34
  • possible duplicate of [In Python, is read() , or readlines() faster?](http://stackoverflow.com/questions/5076024/in-python-is-read-or-readlines-faster) – sista_melody Aug 28 '15 at 17:41

3 Answers3

8

Both options read the whole thing into memory in one big list. The first option is slower because you delegate looping to Python bytecode. If you wanted to create one big list with all lines from your file, then there is no reason to use a list comprehension here.

I'd not use either. Loop over the file and process the lines as you loop:

with open(wordlist) as fileobj:
    for line in fileobj:
        # do something with this line only.

There is usually no need to keep the whole unprocessed file data in memory.

Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
  • But does the first one use less memory as it transfers it line by line? Never considered looping over the file before. I am doing a binary search on the list though so need to sort it first. – KexAri Aug 28 '15 at 17:44
  • @KexAri: the list comprehension uses about the same amount of memory as the `readlines()` call; in the end you have the exact same object. – Martijn Pieters Aug 28 '15 at 17:46
1

I think the real answer is, it depends.

If you have the memory and it doesn't matter how much you use. Then you can by all means put all 4 million strings into a list with the readlines() methods. But then I would ask is it really necessary to keep them all in memory at once?

Probably the more performant method would be to iterate over each line/word at a time, do something with that word (count, hashvectorize, etc) and then let the garbage collector take it to the dump. This method uses a generator which pops off one line at a time versus reading everything into memory unnecessarily.

A lot of the builtins in Python 3.* are moving to this generator style, one example is xrange vs range.

Community
  • 1
  • 1
dalanmiller
  • 2,897
  • 4
  • 24
  • 35
0

Considering you are doing a binary search on the list though so need to sort it first., you need to read the data into a list and sort, on a file with 10 million random digits, calling readlines and an inplace .sort is slightly faster:

In [15]: %%timeit
with open("test.txt") as f:
     r = f.readlines()
     r.sort()
   ....: 
1 loops, best of 3: 719 ms per loop

In [16]: %%timeit
with open("test.txt") as f:
    sorted(f)
   ....: 
1 loops, best of 3: 776 ms per loop

In [17]: %%timeit
with open("test.txt") as f:
     r = [line for line in f] 
     r.sort()
   ....: 

1 loops, best of 3: 735 ms per loop

You have the same data in the list whatever approach you use so there is no memory advantage, the only difference is readlines is a bit more efficient than a list comp or calling sorted on the file object.

Padraic Cunningham
  • 160,756
  • 20
  • 201
  • 286