Did I/O become slower since Python 2.7?

Question

I'm currently having a small side project in which I want to sort a 20GB file on my machine as fast as possible. The idea is to chunk the file, sort the chunks, merge the chunks. I just used pyenv to time the radixsort code with different Python versions and saw that 2.7.18 is way faster than 3.6.10, 3.7.7, 3.8.3 and 3.9.0a. Can anybody explain why Python 3.x is slower than 2.7.18 in this simple example? Were there new features added?

import os


def chunk_data(filepath, prefixes):
    """
    Pre-sort and chunk the content of filepath according to the prefixes.

    Parameters
    ----------
    filepath : str
        Path to a text file which should get sorted. Each line contains
        a string which has at least 2 characters and the first two
        characters are guaranteed to be in prefixes
    prefixes : List[str]
    """
    prefix2file = {}
    for prefix in prefixes:
        chunk = os.path.abspath("radixsort_tmp/{:}.txt".format(prefix))
        prefix2file[prefix] = open(chunk, "w")

    # This is where most of the execution time is spent:
    with open(filepath) as fp:
        for line in fp:
            prefix2file[line[:2]].write(line)

Execution times (multiple runs):

2.7.18: 192.2s, 220.3s, 225.8s
3.6.10: 302.5s
3.7.7: 308.5s
3.8.3: 279.8s, 279.7s (binary mode), 295.3s (binary mode), 307.7s, 380.6s (wtf?)
3.9.0a: 292.6s

The complete code is on Github, along with a minimal complete version

Unicode

Yes, I know that Python 3 and Python 2 deal different with strings. I tried opening the files in binary mode (rb / wb), see the "binary mode" comments. They are a tiny bit faster on a couple of runs. Still, Python 2.7 is WAY faster on all runs.

Try 1: Dictionary access

When I phrased this question, I thought that dictionary access might be a reason for this difference. However, I think the total execution time is way less for dictionary access than for I/O. Also, timeit did not show anything important:

import timeit
import numpy as np

durations = timeit.repeat(
    'a["b"]',
    repeat=10 ** 6,
    number=1,
    setup="a = {'b': 3, 'c': 4, 'd': 5}"
)

mul = 10 ** -7

print(
    "mean = {:0.1f} * 10^-7, std={:0.1f} * 10^-7".format(
        np.mean(durations) / mul,
        np.std(durations) / mul
    )
)
print("min  = {:0.1f} * 10^-7".format(np.min(durations) / mul))
print("max  = {:0.1f} * 10^-7".format(np.max(durations) / mul))

Try 2: Copy time

As a simplified experiment, I tried to copy the 20GB file:

cp via shell: 230s
Python 2.7.18: 237s, 249s
Python 3.8.3: 233s, 267s, 272s

The Python stuff is generated by the following code.

My first thought was that the variance is quite high. So this could be the reason. But then, the variance of chunk_data execution time is also high, but the mean is noticeably lower for Python 2.7 than for Python 3.x. So it seems not to be an I/O scenario as simple as I tried here.

import time
import sys
import os


version = sys.version_info
version = "{}.{}.{}".format(version.major, version.minor, version.micro)


if os.path.isfile("numbers-tmp.txt"):
    os.remove("numers-tmp.txt")

t0 = time.time()
with open("numbers-large.txt") as fin, open("numers-tmp.txt", "w") as fout:
    for line in fin:
        fout.write(line)
t1 = time.time()


print("Python {}: {:0.0f}s".format(version, t1 - t0))

My System

Ubuntu 20.04
Thinkpad T460p
Python through pyenv

What is the timing if you open the file in binary mode in python 3? `open(filepath, mode='rb')` — MaxNoe, May 29 '20 at 06:25
@MaxNoe I've also tried that with Python 3.8 (not the other Python versions). Pretty much the same time — Martin Thoma, May 29 '20 at 06:52
How exactly did you benchmark this code? Which operating system are you using? Could you provide a script which only covers data generation and the part in question? Looking at your Github repo it didn't became immediately obvious how to run only these parts. — a_guest, May 29 '20 at 07:24
@a_guest https://github.com/MartinThoma/algorithms/tree/master/sorting/so-62079732 is a stripped-down version. I'l add it in the post — Martin Thoma, May 29 '20 at 15:27

score 11 · Accepted Answer · answered Jun 03 '20 at 11:06

This is a combination of multiple effects, mostly the fact that Python 3 needs to perform unicode decoding/encoding when working in text mode and if working in binary mode it will send the data through dedicated buffered IO implementations.

First of all, using time.time to measure execution time uses the wall time and hence includes all sorts of Python unrelated things such as OS-level caching and buffering, as well as buffering of the storage medium. It also reflects any interference with other processes that require the storage medium. That's why you are seeing these wild variations in timing results. Here are the results for my system, from seven consecutive runs for each version:

py3 = [660.9, 659.9, 644.5, 639.5, 752.4, 648.7, 626.6]  # 661.79 +/- 38.58
py2 = [635.3, 623.4, 612.4, 589.6, 633.1, 613.7, 603.4]  # 615.84 +/- 15.09

Despite the large variation it seems that these results indeed indicate different timings as can be confirmed for example by a statistical test:

>>> from scipy.stats import ttest_ind
>>> ttest_ind(p2, p3)[1]
0.018729004515179636

i.e. there's only a 2% chance that the timings emerged from the same distribution.

We can get a more precise picture by measuring the process time rather than the wall time. In Python 2 this can be done via time.clock while Python 3.3+ offers time.process_time. These two functions report the following timings:

py3_process_time = [224.4, 226.2, 224.0, 226.0, 226.2, 223.7, 223.8]  # 224.90 +/- 1.09
py2_process_time = [171.0, 171.1, 171.2, 171.3, 170.9, 171.2, 171.4]  # 171.16 +/- 0.16

Now there's much less spread in the data since the timings reflect the Python process only.

This data suggests that Python 3 takes about 53.7 seconds longer to execute. Given the large amount of lines in the input file (550_000_000) this amounts to about 97.7 nanoseconds per iteration.

The first effect causing increased execution time are unicode strings in Python 3. The binary data is read from the file, decoded and then encoded again when it is written back. In Python 2 all strings are stored as binary strings right away, so this doesn't introduce any encoding/decoding overhead. You don't see this effect clearly in your tests because it disappears in the large variation introduced by various external resources which are reflected in the wall time difference. For example we can measure the time it takes for a roundtrip from binary to unicode to binary:

In [1]: %timeit b'000000000000000000000000000000000000'.decode().encode()                     
162 ns ± 2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

This does include two attribute lookups as well as two function calls, so the actual time needed is smaller than the value reported above. To see the effect on execution time, we can change the test script to use binary modes "rb" and "wb" instead of text modes "r" and "w". This reduces the timing results for Python 3 as follows:

py3_binary_mode = [200.6, 203.0, 207.2]  # 203.60 +/- 2.73

That reduces the process time by about 21.3 seconds or 38.7 nanoseconds per iteration. This is in agreement with timing results for the roundtrip benchmark minus timing results for name lookups and function calls:

In [2]: class C: 
   ...:     def f(self): pass 
   ...:                                                                                       

In [3]: x = C()                                                                               

In [4]: %timeit x.f()                                                                         
82.2 ns ± 0.882 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [5]: %timeit x                                                                             
17.8 ns ± 0.0564 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

Here %timeit x measures the additional overhead of resolving the global name x and hence the attribute lookup and function call make 82.2 - 17.8 == 64.4 seconds. Subtracting this overhead twice from the above roundtrip data gives 162 - 2*64.4 == 33.2 seconds.

Now there's still a difference of 32.4 seconds between Python 3 using binary mode and Python 2. This comes from the fact that all the IO in Python 3 goes through the (quite complex) implementation of io.BufferedWriter .write while in Python 2 the file.write method proceeds fairly straightforward to fwrite.

We can check the types of the file objects in both implementations:

$ python3.8
>>> type(open('/tmp/test', 'wb'))
<class '_io.BufferedWriter'>

$ python2.7
>>> type(open('/tmp/test', 'wb'))
<type 'file'>

Here we also need to note that the above timing results for Python 2 have been obtained by using text mode, not binary mode. Binary mode aims to support all objects implementing the buffer protocol which results in additional work being performed also for strings (see also this question). If we switch to binary mode also for Python 2 then we obtain:

py2_binary_mode = [212.9, 213.9, 214.3]  # 213.70 +/- 0.59

which is actually a bit larger than the Python 3 results (18.4 ns / iteration).

The two implementations also differ in other details such as the dict implementation. To measure this effect we can create a corresponding setup:

from __future__ import print_function

import timeit

N = 10**6
R = 7
results = timeit.repeat(
    "d[b'10'].write",
    setup="d = dict.fromkeys((str(i).encode() for i in range(10, 100)), open('test', 'rb'))",  # requires file 'test' to exist
    repeat=R, number=N
)
results = [x/N for x in results]
print(['{:.3e}'.format(x) for x in results])
print(sum(results) / R)

This gives the following results for Python 2 and Python 3:

Python 2: ~ 56.9 nanoseconds
Python 3: ~ 78.1 nanoseconds

This additional difference of about 21.2 nanoseconds amounts to about 12 seconds for the full 550M iterations.

The above timing code checks the dict lookup for only one key, so we also need to verify that there are no hash collisions:

$ python3.8 -c "print(len({str(i).encode() for i in range(10, 100)}))"
90
$ python2.7 -c "print len({str(i).encode() for i in range(10, 100)})"
90

Wow, thank you so much for putting that much effort in it (+1)! I need some time to read it (after work), but it seems quite likely right now that I will then accept it :-) — Martin Thoma, Jun 03 '20 at 12:15
Whoa. Really, really nice question _and_ answer. I don't know when I will need this, but I'm sure that I'll be using this as reference somewhen in the future. — orithena, Jun 03 '20 at 16:49

Did I/O become slower since Python 2.7?

Unicode

Try 1: Dictionary access

Try 2: Copy time

My System

1 Answers1