This all Just Works if you compare the FP bit-patterns as sign/magnitude integers, including -0 < +0
and the NaN bit-patterns1. This is one reason why IEEE formats like binary64 (double
) use a biased exponent and put the fields in that order. (Another being convenient implementation of nextafter
by ++
or --
on the bit pattern.)
That can be implemented efficiently in terms of 2's complement integer compare:
- if the signs are both cleared: non-negative numbers Just Work
- if only one has its sign bit set: any negative is less than any non-negative, including
-0.0 < +0.0
as 0x80000000 < 0x00000000
so 2's complement x <= y
Just Works.
if both have their sign bit set ((x&y)>>63
): 2's complement x<y
is sign/magnitude FP x>y
. In x86 asm, you may avoid the shift and just look at SF, or use the high bit of a SIMD element.
Handling this without messing up the ==
case is tricky: you can't just XOR x&y
sign into a <
result; that would flip it when they compared equal. It would give you <=
when both inputs are negative but <
for other cases. I'm not sure if that's usable for sorting.
With SSE4.2 pcmpgtq you can operate on double FP values in their normal XMM registers, or SSE2 (guaranteed for x86-64) pcmpgtd for 32-bit float. (Note that pcmpgtq
is relatively slow compared to pcmpgtd
: fewer ports and higher latency. https://agner.org/optimize/. e.g. on Skylake, 1 p5 uop with 3c latency, vs. pcmpgtd and pcmpeqq being 1 uop for p0/p1 with 1 cycle latency.)
We can't handle the bitwise-equal case using just one pcmpgtq
+ sign fixups.
x1 bitwise_eq x0
gives a pcmpgtq result of 0 whether or not the inputs are positive or negative. Flipping it based on sign(x0&x1)
would give inconsistent behaviour whether you want 0 or 1 to mean >
, >=
, <
or <=
according to the total order. But unfortunately the -0.0 == +0.0
behaviour of FP comparisons means we have to special-case on FP-equal, not just FP-unordered.
You don't need assembly, just type-pun to uint64_t
in C for example to get the compiler to probably use movq rax, xmm0
, or use intrinsics for vector regs.
But if you are using asm, you could consider doing an FP compare and branching on ZF=1 which will be set for unordered or equal, and only then doing integer. If you expect NaN and exact equality (including +-0.0 == -+0.0
) to be rare, this could work well. Notice that ZF,CF,PF = 1,1,1 for unordered in the ucomisd
docs. All x86 FP compares set flags the same way, either directly or via fcom
/fnstsw ax
/lahf
.
For example a stand-alone version could look like this. (Simplify when inlining, e.g. branch directly with jb
instead of setb
if the caller branches):
totalOrder: ; 0/1 integer in EAX = (xmm0 <= xmm1 totalOrder)
xor eax, eax
ucomisd xmm0, xmm1 ; ZF=0 implies PF=0 (ordered) so just check ZF
jz .compare_as_integer ; unordered or FP-equal
; else CF accurately reflects the < or > (total) order of xmm0 vs. xmm1
setb al ; or branch with jb
ret
;; SSE4.2, using AVX 3-operand versions. Use movaps as needed for non-AVX
### Untested
; Used for unordered or FP-equal, including -0.0 == +0.0
; but also including -1.0 == -1.0 for example
.compare_as_integer: ; should work in general for any sign/magnitude integer
vpcmpgtq xmm2, xmm1, xmm0 ; reversed order of comparison: x1>x0 == x0<x1
vpand xmm3, xmm1, xmm0 ; we only care about the MSB of the 64-bit integer
vpxor xmm2, xmm3 ; flip if x0 & x1 are negative
vpcmpeqq xmm1, xmm0
vpor xmm2, xmm1
; top bits of XMM2 hold the boolean result for each SIMD element
; suitable for use with blendvpd
vmovmskpd eax, xmm2 ; low bit of EAX = valid, high bit might be garbage
and eax, 1 ; optional depending on use-case
; EAX=1 if x0 bitwise_eq x1 or sign/magnitude x1 > x0
ret
With AVX512VL, vpternlogq
can replace all 3 of the AND/XOR/OR operations; it can implement any arbitrary boolean function of 3 inputs. (y_gt_x) ^ (x&y) | y_eq_x
.
Without SSE4.2, or just as a scalar branchless strategy, I came up with this. (e.g. if values were actually in memory so you could just do mov
loads instead of movq
from XMM regs).
;; works on its own, or as the fallback after ucomisd/jz
compare_as_integer:
movq rcx, xmm0
movq rsi, xmm1
xor eax, eax
cmp rcx, rsi
; je bitwise equal special case would simplify the rest
setl al ; 2's complement x < y
sete dl
and rcx, rsi ; maybe something with TEST / CMOVS?
shr rcx, 63
xor al, cl ; flip the SETL result if both inputs were negative
or al, dl ; always true on bitwise equal
ret
The xor-zeroing of EAX should make it safe to read EAX without a partial-reg stall even on P6-family, after writing AL with setl
and 8-bit xor
and or
. (Why doesn't GCC use partial registers?). On most other CPUs, the only downside here is a false dependency on the old value of RDX which I didn't break before sete dl
. If I had xor-zeroed EDX first, we could xor
and or
into EAX.
A branchy strategy could work like this:
;; probably slower unless data is predictable, e.g. mostly non-negative
compare_as_integer_branchy:
movq rcx, xmm0
movq rsi, xmm1
xor eax, eax ; mov eax,1 with je to a ret wouldn't avoid partial-register stalls for setl al
cmp rcx, rsi
je .flip_result ; return 1
setl al ; 2's complement x < y
test rcx, rsi
js .flip_result ; if (x&y both negative)
ret
.flip_result: ; not bitwise EQ, and both inputs negative
xor al, 1
ret
Mix and match parts of this if you want; the AND/SHR/XOR could be used along the non-equal path instead of test+js
.
If inlining this in a case where you branch on the result, you could put the common(?)-case (finite and not equal) branch ahead of the special case handling. But then the special case includes ordered <
so a hopefully-predictable branch on ZF=1 (which includes the PF=1 unordered case) might be a good idea still.
ucomisd xmm1, xmm0
ja x1_gt_x0 ; CF==0 && ZF==0
; maybe unordered, maybe -0 vs +0, maybe just x1 < x0
Footnote 1: NaN encodings as part of the total order
FP values (and their sign/magnitude encodings) are symmetric around zero. The sign bit is always a sign bit, even for NaNs, and can thus be handled the same way.
- The smallest magnitude are +-0.0 of course: all exponent and mantissa bits zero.
- The subnormals have a zero exponent field (minimum value) implying a leading zero for the mantissa. The explicit part is non-zero. Magnitude is linear with mantissa. (Zero is actually just a special case of a subnormal.)
- The normalized numbers span exponent = 1 to exponent < max, implying a leading 1 in the mantissa. The highest value within one exponent (all mantissa bits set) is just below the ++exponent; mantissa=0 value: i.e. increment by 1 with carry from mantissa to exponent increases to next representable float/double
- +- Infinity has exponent = all-ones, mantissa = 0
- +- NaN has exponent = all-ones, mantissa = non-zero
- on x86, sNaN has the highest bit of the mantissa cleared. Rest is payload with at least 1 set bit anywhere (else it's an Inf).
- on x86, qNaN has the highest bit of the mantissa set. Rest is payload
https://cwiki.apache.org/confluence/display/stdcxx/FloatingPoint (linked from Are the bit patterns of NaNs really hardware-dependent?) shows some sNaN and qNaN encodings on a couple other ISAs. Some differ from x86, but POWER and Alpha has the MSB of the mantissa set for qNaN so they have larger integer magnitude than any sNaN.
PA-RISC chose the other way around so implementing a total order predicate on that (obsolete) ISA would need to do extra work for the FP-compare unordered case; maybe flipping that bit in both values might work if either of them are any kind of NaN, before proceeding with integer compare.
(I mention this because the same algorithm could be used in higher level languages that might not be exclusively used on x86. But you might want to just leave it and always handle the same binary bit-patterns the same way, even if that means qNaN < sNaN on some platforms. You only even get sNaN in the first place by manually writing the bit-pattern.)
PS: I know "significand" is more technically correct, but "mantissa" has fewer syllables and I like it better, and is well understood in this context.