4

I have a very large text file has more than 30 GB size. For some reasons, I want to read lines between 1000000 and 2000000 and compare with user input string. If it matches, I need to write the line content into to another file.

I know how to read a file line by line.

input_file = open('file.txt', 'r')
for line in input_file:
    print line

But if the size of file is large, it really affect performance right? How to address this in an optimized way.

Javad Shareef
  • 486
  • 6
  • 18

6 Answers6

9

You can use itertools.islice:

from itertools import islice
with open('file.txt') as fin:
    lines = islice(fin, 1000000, 2000000) # or whatever ranges
    for line in lines:
        # do something

Of course, if your lines are fixed length, you can use that to directly fin.seek() to the start of the line. Otherwise, the approach above still has to read n lines until islice starts producing output, but is just really a convenient way to limit the range.

Jon Clements
  • 124,071
  • 31
  • 219
  • 256
2

You could use linecache.

Let me cite from the docs: "The linecache module allows one to get any line from any file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file.":

import linecache

for i in xrange(1000000, 2000000)
    print linecache.getline('file.txt', i)
miindlek
  • 3,387
  • 11
  • 24
1

Do all your lines have the same size? If that were the case you could probably use seek() to directly jump to the first line you are interested into. Otherwise, you're going to have to iterate through the entire file because there is no way of telling in advance where each line starts:

input_file = open('file.txt', 'r')
for index, line in enumerate(input_file):
    # Assuming you start counting from zero
    if 1000000 <= index <= 2000000:
        print line

For small files, the linecache module can be useful.

plok
  • 693
  • 5
  • 30
1

If you're on Linux, have you considered using the os.system or commands Python modules to directly execute shell commands like sed, awk, head or tail to do this?

Running the command: os.system("tail -n+50000000 test.in | head -n10")

will read line 50.000.000 to 50.000.010 from the file test.in This post on stackoverflow discusses different ways of calling commands, if performance is key there may be more efficient methods than os.system.

This discussion on unix.stackexchange discusses in-depth how to select specific ranges of a text file using the command line:

  • 100,000,000-line file generated by seq 100000000 > test.in
  • Reading lines 50,000,000-50,000,010
  • Tests in no particular order
  • real time as reported by bash's builtin time

The combination of tail and head, or using sed seem to offer the quickest solutions.

 4.373  4.418  4.395    tail -n+50000000 test.in | head -n10
 5.210  5.179  6.181    sed -n '50000000,50000010p;57890010q' test.in
 5.525  5.475  5.488    head -n50000010 test.in | tail -n10
 8.497  8.352  8.438    sed -n '50000000,50000010p' test.in 
22.826 23.154 23.195    tail -n50000001 test.in | head -n10
25.694 25.908 27.638    ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574    awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127    awk 'NR >= 57890000 && NR <= 57890010' test.in
Community
  • 1
  • 1
user1953384
  • 959
  • 2
  • 11
  • 27
0

Generally, you cannot just jump to line number x in file, because text line can have variable lenght, so they can occupy anything between one and gazillion bytes.

However, if you expect to seek in those files very often, you can index them, remembering in separate files at which bytes starts, let's say, every thousandth line. They you can open file and use file.seek() to go to part of file you are interested in, and start itereting from there.

m.wasowski
  • 5,792
  • 1
  • 20
  • 30
0

Best way i found is :

lines_data = []     
text_arr = multilinetext.split('\n')
for i in range(line_number_begin, line_number_end):
    lines_data.append(multilinetext[i])
Maor Kavod
  • 81
  • 1
  • 7