If you're on Linux, have you considered using the os.system
or commands
Python modules to directly execute shell commands like sed
, awk
, head
or tail
to do this?
Running the command: os.system("tail -n+50000000 test.in | head -n10")
will read line 50.000.000 to 50.000.010 from the file test.in
This post on stackoverflow discusses different ways of calling commands, if performance is key there may be more efficient methods than os.system.
This discussion on unix.stackexchange discusses in-depth how to select specific ranges of a text file using the command line:
- 100,000,000-line file generated by
seq 100000000 > test.in
- Reading lines 50,000,000-50,000,010
- Tests in no particular order
- real time as reported by bash's builtin time
The combination of tail and head, or using sed seem to offer the quickest solutions.
4.373 4.418 4.395 tail -n+50000000 test.in | head -n10
5.210 5.179 6.181 sed -n '50000000,50000010p;57890010q' test.in
5.525 5.475 5.488 head -n50000010 test.in | tail -n10
8.497 8.352 8.438 sed -n '50000000,50000010p' test.in
22.826 23.154 23.195 tail -n50000001 test.in | head -n10
25.694 25.908 27.638 ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574 awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127 awk 'NR >= 57890000 && NR <= 57890010' test.in