1
fv13303118  2   918384  FR
fv6665000   2   924898  AS
fv2341362   2   927309  AF
fv9777703   2   928836  TC
fv1891910   2   932457  SG
fv9697457   2   934345  GG
fv35940137  2   940203  GG
fv3128117   2   944564  TT
fv2465126   2   947034  AG

I have more than 50 gb text file like that. I will process it and I need to read just "fvxxxxx" section.

lines = f.readlines()
for x in lines:
    blabla()

I think its definitely not the fastest way

Edit-

Actually there are more than 2000 file. Every file is 20 MB. I want to read just first 11 letter and skip to next line. My memory limit is 4 GB.

Tuğberk
  • 11
  • 3
  • Is the file delimited by spaces? – Jesse Jul 01 '18 at 02:05
  • Is each line guaranteed to be a specific length? Is the first field in each line guaranteed to be a specific length? How much memory can you afford to use while reading the file? If their are specs for the field formats, you should include them in the question. – wwii Jul 01 '18 at 02:08

3 Answers3

4

readlines() reads everything from the input stream into a list, which can be hugely inefficient because of the size of your input greatly exceeding your memory size.

You should use the file object as an iterator so that it reads one line at a time in a memory-efficient way:

for x in f:
    blabla()
blhsing
  • 70,627
  • 6
  • 41
  • 76
  • Coud I just read first 11 letter and skip to next line? This metods not caching all the file but not fast as I want because reading all the lines eventually. – Tuğberk Jul 01 '18 at 02:08
  • No because the fact that you want to read the next line is means that you have to read all the characters in a line in order to find where the next newline character is. – blhsing Jul 01 '18 at 02:11
  • @Tuğberk are the lines a fixed width? If so, you could use `f.seek()` – jedwards Jul 01 '18 at 02:11
  • @Tuğberk just do it like `x.split(" ")[0]` to get the first word - split by space – dmitryro Jul 01 '18 at 02:13
  • 1
    @jedwards has a good point. If all the lines are of the same length then you can you can use `f.read(length_of_first_column)` and then `f.seek(line_number * length_of_each_line)` to jump to the next line. – blhsing Jul 01 '18 at 02:13
  • @jedwards Line lengths of every file is different. But I can calculate them from first line and use f.seek(). Thanks. – Tuğberk Jul 01 '18 at 02:18
1

The standard open() function should by default return a buffered file.

Something like:

with open(<FILE>) as FileObj:
    for line in FileObj:
        x = line.strip().split()[0]
        print x

Edited: to meet your requirement of only printing first part of your line.

Jesse
  • 1,340
  • 1
  • 14
  • 21
1

You can open() the file with a context manager, loop over the file object, split each line on whitespace, and take the first element:

with open('file.txt') as in_file:
    for line in in_file:
        fx, *rest = line.strip().split()
        print(fx)

Which will give you:

fv13303118
fv6665000
fv2341362
fv9777703
fv1891910
fv9697457
fv35940137
fv3128117
fv2465126

The benefit of the above approach is that it uses the file object as an iterator, which avoids copying the whole file into memory at once with readlines().

RoadRunner
  • 23,173
  • 5
  • 28
  • 59