Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs?

Question

Can a scaled 64bit/32bit division performed by the hardware 128bit/64bit division instruction, such as:

; Entry arguments: Dividend in EAX, Divisor in EBX
shl rax, 32  ;Scale up the Dividend by 2^32
xor rdx,rdx
and rbx, 0xFFFFFFFF  ;Clear any garbage that might have been in the upper half of RBX
div rbx  ; RAX = RDX:RAX / RBX

...be faster in some special cases than the scaled 64bit/32bit division performed by the hardware 64bit/32bit division instruction, such as:

; Entry arguments: Dividend in EAX, Divisor in EBX
mov edx,eax  ;Scale up the Dividend by 2^32
xor eax,eax
div ebx  ; EAX = EDX:EAX / EBX

By "some special cases" I mean unusual dividends and divisors. I am interested in comparing the div instruction only.

That's a good question. What makes you suspect that this is the case? — fuz, Jun 18 '19 at 18:59
I suspect this because I assume that the C compiler authors are very smart and so far I have failed to make the popular C compilers generate the latter code when dividing an unsigned 32-bit integer (shifted left 32 bits) by another 32-bit integer. It always compiles to the128bit/64bit `div` instruction. P.S. The left shift compiles fine to `shl` — George Robinson, Jun 18 '19 at 19:04
By the way instead of `and rbx, 0xFFFFFFFF` (which is not possible, but I get what you mean) you can write `mov ebx, ebx` — harold, Jun 18 '19 at 19:48
@harold: Really?! `mov ebx, ebx` will clear the upper 32 bits of `rbx` ? — George Robinson, Jun 18 '19 at 20:05
@GeorgeRobinson: Every instruction that modifies a 32-bit register (e.g. `ebx`) causes the higher 32-bits of a 64-bit register to be zeroed. Note that (because of this) it's likely that the higher 32-bits in `ebx` are already zeroed from earlier (not shown) instructions. — Brendan, Jun 18 '19 at 20:33
Is your question about a specific microarchitecture (e.g., Skylake)? Or do you want to know whether there is a microarchitecture on which the first division can be faster? Or do you want to know whether it can be faster on all x86-64 microarchitectures? — Andreas Abel, Jun 18 '19 at 20:33
Updated my answer now that I had more time: compilers will never do that optimization because they won't be able to prove it won't fault. — Peter Cordes, Jun 18 '19 at 23:25

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

You're asking about optimizing uint64_t / uint64_t C division to a 64b / 32b => 32b x86 asm division, when the divisor is known to be 32-bit. The compiler must of course avoid the possibility of a #DE exception on a perfectly valid (in C) 64-bit division, otherwise it wouldn't have followed the as-if rule. So it can only do this if it's provable that the quotient will fit in 32 bits.

Yes, that's a win or at least break-even. On some CPUs it's even worth checking for the possibility at runtime because 64-bit division is so much slower. But unfortunately current x86 compilers don't have an optimizer pass to look for this optimization even when you do manage to give them enough info that they could prove it's safe. e.g. if (edx >= ebx) __builtin_unreachable(); doesn't help last time I tried.

For the same inputs, 32-bit operand-size will always be at least as fast

16 or 8-bit could maybe be slower than 32 because they may have a false dependency writing their output, but writing a 32-bit register zero-extends to 64 to avoid that. (That's why mov ecx, ebx is a good way to zero-extend ebx to 64-bit, better than and a value that's not encodeable as a 32-bit sign-extended immediate, like harold pointed out). But other than partial-register shenanigans, 16-bit and 8-bit division are generally also as fast as 32-bit, or not worse.

On AMD CPUs, division performance doesn't depend on operand-size, just the data. 0 / 1 with 128/64-bit should be faster than worst-case of any smaller operand-size. AMD's integer-division instruction is only a 2 uops (presumably because it has to write 2 registers), with all the logic done in the execution unit.

16-bit / 8-bit => 8-bit division on Ryzen is a single uop (because it only has to write AH:AL = AX).

On Intel CPUs, div/idiv is microcoded as many uops. About the same number of uops for all operand-sizes up to 32-bit (Skylake = 10), but 64-bit is much much slower. (Skylake div r64 is 36 uops, Skylake idiv r64 is 57 uops). See Agner Fog's instruction tables: https://agner.org/optimize/

div/idiv throughput for operand-sizes up to 32-bit is fixed at 1 per 6 cycles on Skylake. But div/idiv r64 throughput is one per 24-90 cycles.

See also Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux for a specific performance experiment where modifying the REX.W prefix in an existing binary to change div r64 into div r32 made a factor of ~3 difference in throughput.

And Why does Clang do this optimization trick only from Sandy Bridge onward? shows clang opportunistically using 32-bit division when the dividend is small, when tuning for Intel CPUs. But you have a large dividend and a large-enough divisor, which is a more complex case. That clang optimization is still zeroing the upper half of the dividend in asm, never using a non-zero or non-sign-extended EDX.

I have failed to make the popular C compilers generate the latter code when dividing an unsigned 32-bit integer (shifted left 32 bits) by another 32-bit integer.

I'm assuming you cast that 32-bit integer to uint64_t first, to avoid UB and get a normal uint64_t / uint64_t in the C abstract machine.

That makes sense: Your way wouldn't be safe, it will fault with #DE when edx >= ebx. x86 division faults when the quotient overflows AL / AX / EAX / RAX, instead of silently truncating. There's no way to disable that.

So compilers normally only use idiv after cdq or cqo, and div only after zeroing the high half, unless you use an intrinsic or inline asm to open yourself up to the possibility of your code faulting. In C, x / y only faults if y = 0 (or for signed, INT_MIN / -1 is also allowed to fault¹).

GNU C doesn't have an intrinsic for wide division, but MSVC has _udiv64. (With gcc/clang, division wider than 1 register uses a helper function which does try to optimize for small inputs. But this doesn't help for 64/32 division on a 64-bit machine, where GCC and clang just use the 128/64-bit division instruction.)

Even if there were some way to promise the compiler that your divisor would be big enough to make the quotient fit in 32 bits, current gcc and clang don't look for that optimization in my experience. It would be a useful optimization for your case (if it's always safe), but compilers won't look for it.

Footnote 1: To be more specific, ISO C describes those cases as "undefined behaviour"; some ISAs like ARM have non-faulting division instructions. C UB means anything can happen, including just truncation to 0 or some other integer result. See Why does integer division by -1 (negative one) result in FPE? for an example of AArch64 vs. x86 code-gen and results. Allowed to fault doesn't mean required to fault.

I think you meant "mov ecx, ebx is a good way to zero-extend ecx to 64-bit,"...you wrote "...a good way to zero-extend ebx to 64-bit, — George Robinson, Jun 19 '19 at 10:51
I know that `div ebx` will overflow and fault when `ebx<=edx` but such faulting is acceptable to me because it is an indicator, that the input arguments to my function (the dividend and divisor) are wrong. In my function the divisor is always greater than the dividend and I wish the C compiler would account for that when choosing the size of the `div` instruction. — George Robinson, Jun 19 '19 at 10:55
@GeorgeRobinson: I meant "a good way to zero-extend *the value in* EBX into a 64-bit register". I picked a different destination (RCX) so mov-elimination can work. [Can x86's MOV really be "free"? Why can't I reproduce this at all?](//stackoverflow.com/q/44169342) — Peter Cordes, Jun 19 '19 at 15:24
@GeorgeRobinson: If you used `if(b <= d) __builtin_unreachable();` then you'd see that gcc has a missed optimization: it doesn't look for the opportunity to use 32-bit `div` when the C operands are 64-bit. You'd need to use inline asm because I don't think gcc has a builtin for 64/32-bit can-fault-on-overflow division. — Peter Cordes, Jun 19 '19 at 15:30
Few points of note: Shifting left 32 is UB if the operand is not greater than 32bit. If a compiler is nice it may automatically promote that in which case it's correct in treating it as if it was 64bit. Regardless this should leave the lower bits of the number as zeros. This would then result in `0 / a` assuming truncation to 32bit again. `((UINT32)((UINT64)b<<32)) / a` is not the same as `((UINT64)b<<32) / a` — Mgetz, Jun 20 '19 at 16:01
@Mgetz: Were you replying to the OP? I was assuming the OP was using a C expression that resulted in the desired `uint64_t / uint64_t` after integer promotions. The question is about hoping the compiler will optimize that to a 64b / 32b => 32b division when the divisor is known to be 32-bit, and it's provable that the quotient will fit in 32 bits. Which is a win (and will produce its 32-bit result zero-extended to 64-bit, just like C needs to implement 64/64 => 64-bit division). On some CPUs even worth checking for at runtime because 64-bit division is so much slower. — Peter Cordes, Jun 20 '19 at 18:25
@PeterCordes sort of, they OP posted a [follow up question](https://stackoverflow.com/q/56657236/332733) my comment was basically to provide a bit of context for those using your answer coming from there. — Mgetz, Jun 20 '19 at 18:33
@Mgetz: After my last comment, I realized this question could use a bit of context, so I added a summary section at the top of my answer about the C optimization they're hoping for. — Peter Cordes, Jun 20 '19 at 18:41

score 2 · Answer 2 · answered Jun 18 '19 at 20:30

Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs?

In theory, anything is possible (e.g. maybe in 50 years time Nvidia creates an 80x86 CPU that ...).

However, I can't think of a single plausible reason why a 128bit/64bit division would ever be faster than (not merely equivalent to) a 64bit/32bit division on x86-64.

I suspect this because I assume that the C compiler authors are very smart and so far I have failed to make the popular C compilers generate the latter code when dividing an unsigned 32-bit integer (shifted left 32 bits) by another 32-bit integer. It always compiles to the128bit/64bit div instruction. P.S. The left shift compiles fine to shl.

Compiler developers are smart, but compilers are complex and the C language rules get in the way. For example, if you just do a a = b/c; (with b being 64 bit and c being 32-bit) the language's rules are that c gets promoted to 64-bit before the division happens, so it ends up being a 64-bit divisor in some kind of intermediate language, and that makes it hard for the back-end translation (from intermediate language to assembly language) to tell that the 64-bit divisor could be a 32-bit divisor.

Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs?

2 Answers2

For the same inputs, 32-bit operand-size will always be at least as fast

Linked

Related