Short answer: Floating point division is cheaper at the hardware level than integer division. And on at least one common architecture, floating point division can be vectorized, while integer division cannot, so the most expensive operation in your code must be performed more times, and higher cost per operation for integer math, while floating point math does it fewer times, at lower cost per operation.
Long answer: numpy
uses vectorized math when available, and the x86-64 architecture (which I'm guessing you're using) doesn't provide a SIMD instruction for integer division. It only provides vectorized multiplication for integers (via the PMULUDQ
family of instructions), but provides both multiplication (MULPD
family) and division (DIVPD
family) for floating point.
When you use /
for true divide, the result type is float64
, not int64
, and numpy
could perform the operation by a single packed load and convert (with the VCVTQQ2PD
family of operations, followed by a packed divide, followed by a packed move back to memory (MOVAPD
family).
On the most modern x86-64 chips with AVX512 (Xeon Phi x200+ and Skylake-X and higher, the latter being available to the desktop market since late 2017), each such vectorized instruction can perform eight operations at once (older architectures from post-2011 can do four with AVX, and before that you could do two with SSE2). For /
, that means you only need to issue two VCVTQQ2PD
s (one from each source array), one VDIVPD
and one VMOVAPD
(all EVEX
prefixed for 512 bit operation) for every eight divisions to be performed. By contrast, for //
to perform the same eight divisions it needs to issue eight MOV
s from memory (to load the left array operands), eight CQO
s (to sign extend the left array operands to the 128 bit values IDIV
needs), eight IDIV
s (which loads from the right side array for you at least), and eight MOV
s back to memory.
I don't know if numpy
takes full advantage of this (my own copy is clearly compiled for the SSE2 baseline all x86-64 machines provide, so it only does two divisions at once, not eight), but it's possible, where there is no way to vectorize the equivalent integer operations at all.
While the individual instructions for the integer case are often a little cheaper, they're basically always more expensive than the combined equivalent instructions. And for integer division, it's actually worse for a single operation than floating point division for the packed operation; per Agner Fog's Skylake-X table, the cost per IDIV
is 24-90 cycles, with a latency of 42-95; the cost of VDIVPD
with all 512 bit registers is 16 cycles, with a latency of 24 cycles. VDIVPD
isn't just doing eight times the work, it's doing it in (at most) two thirds the cycles required by IDIV
(I don't know why IDIV
has such a great range for cycle count, but VDIVPD
beats even the best number for IDIV
). For plain AVX operations (only four divides per VDIVPD
), the cycles per op is cut in half (to eight), while plain DIVPD
on two divides per instruction is only four cycles, so the division itself is basically the same speed regardless of whether you use SSE2, AVX or AVX512 instructions (AVX512 just saves you a little on latency, and load/store overhead). Even if vectorized instructions were never used, plain FDIV
is only a 4-5 cycle instruction (binary floating point division happens to be easier than integer division in general, go figure), so you'd expect to see floating point math do pretty well.
Point is, at a hardware level, dividing a lot of 64 bit floating point values is cheaper than dividing a lot of 64 bit integer values, so true division using /
is inherently faster than floor division using //
.
On my own machine (which I've verified uses only baseline SSE2 DIVPD
, so it only does two divisions per instruction), I attempted to replicate your results, and my timings were a little less divergent. True division which took 485 μs per operation, while floor division took 1.05 ms per op; floor division was only slightly more than 2x longer, where it was almost 3x longer for you. At a guess, your copy of numpy
was compiled with support for AVX or AVX512, and you're squeezing a bit more performance out of true division as a result.
As for why non-numpy
Python int
floor division takes longer than true division, it's a similar reason, but there are a few complicating factors:
- (Helps
int / int
, hurts int // int
) The same hardware instruction overhead issues from numpy
apply; IDIV
is slower than FDIV
.
- (Hides performance differences) The interpreter overhead of performing individual divisions is a greater percentage of the total time (reduces relative performance differences, since more of the time is spent on overhead)
- (Increases cost of integer operations in general) Integer division is even more expensive on Python
int
s; on CPython, int
is implemented as an array of 15 or 30 bit limbs, with a ssize_t
that defines both the signedness and the number of limbs. CPython's float
is just a plain object wrapper around a normal C double
with no special features.
- (Increases the cost of
int // int
) Python guarantees floor division, but C only provides truncating integer division, so even in the fast path for small int
s, Python has to check for mismatched signs and adjust the operation to ensure the result is the floor, not simple truncation.
- (Increases cost of
int / int
operation specifically for large inputs) CPython's int / int
operation doesn't just cast both operands to float
(C double
) and perform floating point division. It tries to do so when the operands are small enough, but if they're too large, it has a complicated fallback algorithm to achieve best possible correctness.
- (Reduce cost of repeated
int / int
if result is discarded in short order) Since Python float
s are fixed size, they implemented a minor performance optimization to use a free list for them; when you create and delete float
s repeatedly, the newly created ones don't need to go to the memory allocator, they just pull from the free list, and release to the free list when the reference count drops to zero. CPython's int
s are variable length, and a free list isn't used.
On balance, this is all slightly in favor of int / int
(at least for small int
s; the large int
case gets more complex, but then, it's probably worse for int // int
too, since array-based division algorithms are insanely complicated/expensive), so seeing similar behaviors with Python built-in types is not unexpected.