How can I read line faster?

Question

fv13303118  2   918384  FR
fv6665000   2   924898  AS
fv2341362   2   927309  AF
fv9777703   2   928836  TC
fv1891910   2   932457  SG
fv9697457   2   934345  GG
fv35940137  2   940203  GG
fv3128117   2   944564  TT
fv2465126   2   947034  AG

I have more than 50 gb text file like that. I will process it and I need to read just "fvxxxxx" section.

lines = f.readlines()
for x in lines:
    blabla()

I think its definitely not the fastest way

Edit-

Actually there are more than 2000 file. Every file is 20 MB. I want to read just first 11 letter and skip to next line. My memory limit is 4 GB.

Is each line guaranteed to be a specific length? Is the first field in each line guaranteed to be a specific length? How much memory can you afford to use while reading the file? If their are specs for the field formats, you should include them in the question. — wwii, Jul 01 '18 at 02:08

score 4 · Answer 1 · answered Jul 01 '18 at 02:05

4

readlines() reads everything from the input stream into a list, which can be hugely inefficient because of the size of your input greatly exceeding your memory size.

You should use the file object as an iterator so that it reads one line at a time in a memory-efficient way:

for x in f:
    blabla()

answered Jul 01 '18 at 02:05

blhsing

70,627
6
41
76

Coud I just read first 11 letter and skip to next line? This metods not caching all the file but not fast as I want because reading all the lines eventually. – Tuğberk Jul 01 '18 at 02:08
No because the fact that you want to read the next line is means that you have to read all the characters in a line in order to find where the next newline character is. – blhsing Jul 01 '18 at 02:11
@Tuğberk are the lines a fixed width? If so, you could use `f.seek()` – jedwards Jul 01 '18 at 02:11
@Tuğberk just do it like `x.split(" ")[0]` to get the first word - split by space – dmitryro Jul 01 '18 at 02:13
1

@jedwards has a good point. If all the lines are of the same length then you can you can use `f.read(length_of_first_column)` and then `f.seek(line_number * length_of_each_line)` to jump to the next line. – blhsing Jul 01 '18 at 02:13
@jedwards Line lengths of every file is different. But I can calculate them from first line and use f.seek(). Thanks. – Tuğberk Jul 01 '18 at 02:18

Jesse · Answer 2 · 2018-07-01T02:16:22.180

1

The standard open() function should by default return a buffered file.

Something like:

with open(<FILE>) as FileObj:
    for line in FileObj:
        x = line.strip().split()[0]
        print x

Edited: to meet your requirement of only printing first part of your line.

edited Jul 01 '18 at 02:16

answered Jul 01 '18 at 02:08

Jesse

1,340
1
14
21

RoadRunner · Answer 3 · 2018-07-01T02:40:10.937

You can open() the file with a context manager, loop over the file object, split each line on whitespace, and take the first element:

with open('file.txt') as in_file:
    for line in in_file:
        fx, *rest = line.strip().split()
        print(fx)

Which will give you:

The benefit of the above approach is that it uses the file object as an iterator, which avoids copying the whole file into memory at once with readlines().

How can I read line faster?

3 Answers3