2

Wondering what is the most efficient method to read data from a locally hosted file using python.

Either using subprocesses and just cat the contents of the file:

ssh = subprocess.Popen(['cat', dir_to_file],
                       stdout=subprocess.PIPE)
for line in ssh.stdout:
    print line

OR simply read contents of the file:

f = open(dir_to_file)
data = f.readlines()
f.close()
for line in data:
    print line

I am creating a script that has to read the contents of many files and I was wondering which method is most efficient in terms of CPU usage and also which is the fastest in terms of runtime.

This is my first post here at stackoverflow, apologies on formatting.

Thanks

Prash
  • 21
  • 1
  • In case you want line by line, you don't need to open and then readlines, you can directly `for line in open(dir_to_file)` – heltonbiker Apr 27 '16 at 02:28
  • 1
    My guess is if CPU usage is your concern, then it matters more what you are doing line by line than what is doing the reading. You are going to be IO bottlenecked by the hard drive before the CPU gives out on these examples. – chrisd1100 Apr 27 '16 at 02:35
  • 2
    You've asked a question only you can answer. Implement both and measure them. No one else can tell you which is better for *your* data on *your* computer. Having said that, I claim that the 2nd is almost guaranteed to be faster. – Robᵩ Apr 27 '16 at 02:40
  • @heltonbiker I need to read all lines – Prash Apr 27 '16 at 02:42
  • @Robᵩ I will give it a try, but since I am going to be reading hundreds of files in the matter of seconds, I think the 1st is less load on the cpu. I do notice cpu usage spiking to ~100% when running the script using the 2nd method (file read) – Prash Apr 27 '16 at 02:46
  • 1- define *"read data"*: do you want to decode bytes into Unicode text? do you want to read line by line? See [`read_read()`, `read_readtxt()`, `read_readlines()`](http://stackoverflow.com/a/13861768/4279) 2- if your task is I/O bound; it doesn't matter how *"efficient in terms of CPU usage"* —your program will wait for the disk anyway (drop cache and run your benchmark e.g., [the same command may be x10 times slower if run with cold file cache](http://stackoverflow.com/questions/25606833/fastest-way-to-sum-integers-in-text-file/25607155#comment40064167_25606833)). – jfs Apr 27 '16 at 15:14
  • 3- the performance may depend on the physical order files are stored on disk, see [Python slow read performance issue](https://stackoverflow.com/q/26178038/4279) – jfs Apr 27 '16 at 15:14

1 Answers1

1

@chrisd1100 is correct that printing line by line is the bottleneck. After a quick experiment, here is what I found.

I ran and timed the two methods above repeatedly (A - subprocess, B - readline) on two different file sizes (~100KB and ~10MB).

Trial 1: ~100KB

subprocess: 0.05 - 0.1 seconds
readline:   0.02 - 0.026 seconds

Trial 2: ~10MB

subprocess: ~7 seconds
readlin:    ~7 seconds

At the larger file size, printing line by line becomes by far the most expensive operation. On smaller file sizes, it seems that readline has about 2x speed performance. Tentatively, I'd say that readline is faster.

These were all run on Python 2.7.10, OSX 10.11.13, 2.8 Ghz i7.

Phillip Martin
  • 1,750
  • 12
  • 29
  • Great analysis, so it looks like readline is the way to go for smaller files and for larger files the runtime difference is insignificant. But I did notice cpu usage seems to spike to >90% when using readline on reading from many files. – Prash Apr 27 '16 at 03:26
  • I don't currently know of a way to profile the two programs on their cpu usage accurately; although that does seem like an interesting thing to look into and I would have loved to include it :) But if your observation of 90% is compelling enough for you, then go ahead and use `subprocess`. Since you'll be working on this, If you find anything new, please come back and update this thread. I'd be interested to know. – Phillip Martin Apr 27 '16 at 12:17