speed of elementary mathematical operations in Numpy/Python: why is integer division slowest?

Question

EDIT2: As @ShadowRanger pointed out, this is a Numpy phenomenon, not Python. But when doing the calculations in Python with list comprehensions (so x+y becomes [a+b for a,b in zip(x,y)]) then all arithmetic operations still take equally long (although more than 100x as long as Numpy). However, when I use integer division in my real simulations, they run way faster. So the main question remains: even in Python, why do these tests suggest integer division isn't faster than regular division?

EDIT1: versions: Python 3.5.5, Numpy 1.15.0.

It seems that in ~~Python~~Numpy, integer division is more expensive than regular division (of integers), which is counter-intuitive. When testing, I get this:

setup_string = 'import numpy as np;\
                N=int(1e5);\
                x=np.arange(1,N+1, dtype=int);\
                y=np.arange(N, dtype=int);'

addition (+) ~ 0.1s

timeit("x+y", setup=setup_string, number=int(1e3))
0.09872294100932777

subtraction (-) ~ 0.1s

timeit("x-y", setup=setup_string, number=int(1e3))
0.09425603999989107

multiplication (*) ~ 0.1s

timeit("x*y", setup=setup_string, number=int(1e3))
0.09888673899695277

division (/) ~ 0.35s

timeit("x/y", setup=setup_string, number=int(1e3))
0.3574664070038125

integer division (//) ~ 1s (!)

timeit("x//y", setup=setup_string, number=int(1e3))
1.006298642983893

Any ideas why this is happening? Why is integer division not faster?

This isn't about *Python's* integer division, it's about `numpy`'s. Kind of an important distinction. — ShadowRanger, Aug 02 '19 at 11:17
Can you specify which version of Python and `numpy` you're using? — ShadowRanger, Aug 02 '19 at 11:39
Python 3.5.5, Numpy 1.15.0 Also, @ShadowRanger you might be onto something here. In my simulations I use list comprehensions (because I need rounded integer division) and then convert back to numpy array, so probably there is a (big) difference between Python and Numpy implementation of integer division. I'll test it ASAP. — PDiracDelta, Aug 02 '19 at 11:52
@ShadowRanger even when doing it in Python, integer division seems not to be faster (see updated question) — PDiracDelta, Aug 02 '19 at 15:53
well spotted. I fixed `setup_string`. The division by 0 has barely any or no effect on the timings (I just checked it) — PDiracDelta, Aug 02 '19 at 16:19

ShadowRanger · Accepted Answer · 2019-08-02T19:52:57.623

Short answer: Floating point division is cheaper at the hardware level than integer division. And on at least one common architecture, floating point division can be vectorized, while integer division cannot, so the most expensive operation in your code must be performed more times, and higher cost per operation for integer math, while floating point math does it fewer times, at lower cost per operation.

Long answer: numpy uses vectorized math when available, and the x86-64 architecture (which I'm guessing you're using) doesn't provide a SIMD instruction for integer division. It only provides vectorized multiplication for integers (via the PMULUDQ family of instructions), but provides both multiplication (MULPD family) and division (DIVPD family) for floating point.

When you use / for true divide, the result type is float64, not int64, and numpy could perform the operation by a single packed load and convert (with the VCVTQQ2PD family of operations, followed by a packed divide, followed by a packed move back to memory (MOVAPD family).

On the most modern x86-64 chips with AVX512 (Xeon Phi x200+ and Skylake-X and higher, the latter being available to the desktop market since late 2017), each such vectorized instruction can perform eight operations at once (older architectures from post-2011 can do four with AVX, and before that you could do two with SSE2). For /, that means you only need to issue two VCVTQQ2PDs (one from each source array), one VDIVPD and one VMOVAPD (all EVEX prefixed for 512 bit operation) for every eight divisions to be performed. By contrast, for // to perform the same eight divisions it needs to issue eight MOVs from memory (to load the left array operands), eight CQOs (to sign extend the left array operands to the 128 bit values IDIV needs), eight IDIVs (which loads from the right side array for you at least), and eight MOVs back to memory.

I don't know if numpy takes full advantage of this (my own copy is clearly compiled for the SSE2 baseline all x86-64 machines provide, so it only does two divisions at once, not eight), but it's possible, where there is no way to vectorize the equivalent integer operations at all.

While the individual instructions for the integer case are often a little cheaper, they're basically always more expensive than the combined equivalent instructions. And for integer division, it's actually worse for a single operation than floating point division for the packed operation; per Agner Fog's Skylake-X table, the cost per IDIV is 24-90 cycles, with a latency of 42-95; the cost of VDIVPD with all 512 bit registers is 16 cycles, with a latency of 24 cycles. VDIVPD isn't just doing eight times the work, it's doing it in (at most) two thirds the cycles required by IDIV (I don't know why IDIV has such a great range for cycle count, but VDIVPD beats even the best number for IDIV). For plain AVX operations (only four divides per VDIVPD), the cycles per op is cut in half (to eight), while plain DIVPD on two divides per instruction is only four cycles, so the division itself is basically the same speed regardless of whether you use SSE2, AVX or AVX512 instructions (AVX512 just saves you a little on latency, and load/store overhead). Even if vectorized instructions were never used, plain FDIV is only a 4-5 cycle instruction (binary floating point division happens to be easier than integer division in general, go figure), so you'd expect to see floating point math do pretty well.

Point is, at a hardware level, dividing a lot of 64 bit floating point values is cheaper than dividing a lot of 64 bit integer values, so true division using / is inherently faster than floor division using //.

On my own machine (which I've verified uses only baseline SSE2 DIVPD, so it only does two divisions per instruction), I attempted to replicate your results, and my timings were a little less divergent. True division which took 485 μs per operation, while floor division took 1.05 ms per op; floor division was only slightly more than 2x longer, where it was almost 3x longer for you. At a guess, your copy of numpy was compiled with support for AVX or AVX512, and you're squeezing a bit more performance out of true division as a result.

As for why non-numpy Python int floor division takes longer than true division, it's a similar reason, but there are a few complicating factors:

(Helps int / int, hurts int // int) The same hardware instruction overhead issues from numpy apply; IDIV is slower than FDIV.
(Hides performance differences) The interpreter overhead of performing individual divisions is a greater percentage of the total time (reduces relative performance differences, since more of the time is spent on overhead)
(Increases cost of integer operations in general) Integer division is even more expensive on Python ints; on CPython, int is implemented as an array of 15 or 30 bit limbs, with a ssize_t that defines both the signedness and the number of limbs. CPython's float is just a plain object wrapper around a normal C double with no special features.
(Increases the cost of int // int) Python guarantees floor division, but C only provides truncating integer division, so even in the fast path for small ints, Python has to check for mismatched signs and adjust the operation to ensure the result is the floor, not simple truncation.
(Increases cost of int / int operation specifically for large inputs) CPython's int / int operation doesn't just cast both operands to float (C double) and perform floating point division. It tries to do so when the operands are small enough, but if they're too large, it has a complicated fallback algorithm to achieve best possible correctness.
(Reduce cost of repeated int / int if result is discarded in short order) Since Python floats are fixed size, they implemented a minor performance optimization to use a free list for them; when you create and delete floats repeatedly, the newly created ones don't need to go to the memory allocator, they just pull from the free list, and release to the free list when the reference count drops to zero. CPython's ints are variable length, and a free list isn't used.

On balance, this is all slightly in favor of int / int (at least for small ints; the large int case gets more complex, but then, it's probably worse for int // int too, since array-based division algorithms are insanely complicated/expensive), so seeing similar behaviors with Python built-in types is not unexpected.

speed of elementary mathematical operations in Numpy/Python: why is integer division slowest?

1 Answers1