How to implement the totalOrder predicate for floating point values?

Question

The IEEE 754 specification defines a total order in §5.10, which I want to implement in assembly.

From the Wikipedia description, it sounds a lot like this can be implemented branch-free, or almost branch-free, but I haven't been able to come up with a decent approach; and I couldn't find any existing spec-compliant implementation in major programming languages

When comparing two floating-point numbers, it acts as the ≤ operation, except that totalOrder(−0, +0) ∧ ¬ totalOrder(+0, −0), and different representations of the same floating-point number are ordered by their exponent multiplied by the sign bit. The ordering is then extended to the NaNs by ordering −qNaN < −sNaN < numbers < +sNaN < +qNaN, with ordering between two NaNs in the same class being based on the integer payload, multiplied by the sign bit, of those data.

Does it make sense to first check for NaNs and then either jump to a floating point comparison or handle the NaN case, or does it make more sense to move the floating point value to integer registers and do all the operations there?

(At least from reading the description, it feels like the spec authors made an effort to allow a straightforward implementation with integer instructions.)

What's the "best" way to implement a total order for floating points on x86-64 processors?

I think you can just handle the sign bit and then do signed-integer comparison. (Comparing FP bit-patterns as integers Just Works because of the exponent bias; that's why it's done that way. Except they're sign/magnitude integers vs. x86-64 hardware implementing 2's complement.) NaN has all-ones exponent and non-zero significand so is already farther from `0` than finite or +-Inf. Also related: [SIMD instructions for floating point equality comparison (with NaN == NaN)](//stackoverflow.com/q/34951714) — Peter Cordes, Dec 15 '19 at 21:59
e.g. with [SSE4.2 `pcmpgtq`](https://www.felixcloutier.com/x86/pcmpgtq) you can operate on `double` FP values in their normal XMM registers, or SSE2 (guaranteed for x86-64) [`pcmpgtd`](https://www.felixcloutier.com/x86/pcmpgtb:pcmpgtw:pcmpgtd) for 32-bit `float`. (BTW, you don't need assembly, just type-pun to `uint64_t` to get the compiler to probably use `movq rax, xmm0`, or use intrinsics for vector regs.) — Peter Cordes, Dec 15 '19 at 22:03
The "multiple representations of the same finite value" could apply to 80-bit Extended precision "pseudo-denormal" with the *explicit* leading 1 bit of the mantissa cleared https://en.wikipedia.org/wiki/Extended_precision#x86_extended_precision_format; an x87 FPU will never generate numbers like that but apparently do handle them. (float and double store that bit implicitly, implied by `exponent_encoding == 0`). BTW, can you link the wikipedia article you're quoting? — Peter Cordes, Dec 15 '19 at 22:05
@PeterCordes: It also applies to the decimal interchange formats. The paragraph shown applies to both binary and decimal formats, but I'm assuming that the OP is only interested in the binary format (and probably only specifically interested in either the binary64 or binary32 interchange formats). — Mark Dickinson, Dec 15 '19 at 22:22
@PeterCordes as far as I understand the total order would place "negative" NaNs as the smallest values and "positive" NaNs as the largest values, not sure whether this would be handled correctly ... — soc, Dec 15 '19 at 22:57

Peter Cordes · Answer 1 · 2019-12-17T23:00:47.030

This all Just Works if you compare the FP bit-patterns as sign/magnitude integers, including -0 < +0 and the NaN bit-patterns¹. This is one reason why IEEE formats like binary64 (double) use a biased exponent and put the fields in that order. (Another being convenient implementation of nextafter by ++ or -- on the bit pattern.)

That can be implemented efficiently in terms of 2's complement integer compare:

if the signs are both cleared: non-negative numbers Just Work
if only one has its sign bit set: any negative is less than any non-negative, including -0.0 < +0.0 as 0x80000000 < 0x00000000 so 2's complement x <= y Just Works.
if both have their sign bit set ((x&y)>>63): 2's complement x<y is sign/magnitude FP x>y. In x86 asm, you may avoid the shift and just look at SF, or use the high bit of a SIMD element.

Handling this without messing up the == case is tricky: you can't just XOR x&y sign into a < result; that would flip it when they compared equal. It would give you <= when both inputs are negative but < for other cases. I'm not sure if that's usable for sorting.

With SSE4.2 pcmpgtq you can operate on double FP values in their normal XMM registers, or SSE2 (guaranteed for x86-64) pcmpgtd for 32-bit float. (Note that pcmpgtq is relatively slow compared to pcmpgtd: fewer ports and higher latency. https://agner.org/optimize/. e.g. on Skylake, 1 p5 uop with 3c latency, vs. pcmpgtd and pcmpeqq being 1 uop for p0/p1 with 1 cycle latency.)

We can't handle the bitwise-equal case using just one pcmpgtq + sign fixups.
x1 bitwise_eq x0 gives a pcmpgtq result of 0 whether or not the inputs are positive or negative. Flipping it based on sign(x0&x1) would give inconsistent behaviour whether you want 0 or 1 to mean >, >=, < or <= according to the total order. But unfortunately the -0.0 == +0.0 behaviour of FP comparisons means we have to special-case on FP-equal, not just FP-unordered.

You don't need assembly, just type-pun to uint64_t in C for example to get the compiler to probably use movq rax, xmm0, or use intrinsics for vector regs.

But if you are using asm, you could consider doing an FP compare and branching on ZF=1 which will be set for unordered or equal, and only then doing integer. If you expect NaN and exact equality (including +-0.0 == -+0.0) to be rare, this could work well. Notice that ZF,CF,PF = 1,1,1 for unordered in the ucomisd docs. All x86 FP compares set flags the same way, either directly or via fcom/fnstsw ax/lahf.

For example a stand-alone version could look like this. (Simplify when inlining, e.g. branch directly with jb instead of setb if the caller branches):

totalOrder:   ; 0/1 integer in EAX = (xmm0 <= xmm1 totalOrder)
    xor      eax, eax
    ucomisd  xmm0, xmm1           ; ZF=0 implies PF=0 (ordered) so just check ZF
    jz    .compare_as_integer     ; unordered or FP-equal
     ; else CF accurately reflects the < or > (total) order of xmm0 vs. xmm1
    setb     al                 ; or branch with jb
    ret

;; SSE4.2, using AVX 3-operand versions.  Use movaps as needed for non-AVX
### Untested
        ; Used for unordered or FP-equal, including -0.0 == +0.0
        ; but also including -1.0 == -1.0 for example
 .compare_as_integer:          ; should work in general for any sign/magnitude integer
    vpcmpgtq xmm2, xmm1, xmm0     ; reversed order of comparison: x1>x0 == x0<x1
    vpand    xmm3, xmm1, xmm0     ; we only care about the MSB of the 64-bit integer
    vpxor    xmm2, xmm3           ; flip if x0 & x1 are negative

    vpcmpeqq xmm1, xmm0
    vpor     xmm2, xmm1
       ; top bits of XMM2 hold the boolean result for each SIMD element
       ; suitable for use with blendvpd

    vmovmskpd eax, xmm2           ; low bit of EAX = valid, high bit might be garbage
    and      eax, 1          ; optional depending on use-case
     ; EAX=1 if x0 bitwise_eq x1 or sign/magnitude x1 > x0
    ret

With AVX512VL, vpternlogq can replace all 3 of the AND/XOR/OR operations; it can implement any arbitrary boolean function of 3 inputs. (y_gt_x) ^ (x&y) | y_eq_x.

Without SSE4.2, or just as a scalar branchless strategy, I came up with this. (e.g. if values were actually in memory so you could just do mov loads instead of movq from XMM regs).

;; works on its own, or as the fallback after ucomisd/jz
compare_as_integer:
    movq     rcx, xmm0
    movq     rsi, xmm1

    xor      eax, eax
    cmp      rcx, rsi
   ; je  bitwise equal special case would simplify the rest
    setl     al                 ; 2's complement x < y
    sete     dl
    and      rcx, rsi           ; maybe something with TEST / CMOVS?
    shr      rcx, 63
    xor      al, cl           ; flip the SETL result if both inputs were negative
    or       al, dl           ; always true on bitwise equal
    ret

The xor-zeroing of EAX should make it safe to read EAX without a partial-reg stall even on P6-family, after writing AL with setl and 8-bit xor and or. (Why doesn't GCC use partial registers?). On most other CPUs, the only downside here is a false dependency on the old value of RDX which I didn't break before sete dl. If I had xor-zeroed EDX first, we could xor and or into EAX.

A branchy strategy could work like this:

;; probably slower unless data is predictable, e.g. mostly non-negative
compare_as_integer_branchy:
    movq     rcx, xmm0
    movq     rsi, xmm1

    xor      eax, eax       ; mov eax,1 with je to a ret wouldn't avoid partial-register stalls for setl al
    cmp      rcx, rsi
    je      .flip_result        ; return 1
    setl     al                 ; 2's complement x < y

    test     rcx, rsi
    js     .flip_result         ; if (x&y both negative)
    ret

.flip_result:         ; not bitwise EQ, and both inputs negative
    xor      al, 1
    ret

Mix and match parts of this if you want; the AND/SHR/XOR could be used along the non-equal path instead of test+js.

If inlining this in a case where you branch on the result, you could put the common(?)-case (finite and not equal) branch ahead of the special case handling. But then the special case includes ordered < so a hopefully-predictable branch on ZF=1 (which includes the PF=1 unordered case) might be a good idea still.

    ucomisd  xmm1, xmm0
    ja       x1_gt_x0                ; CF==0 && ZF==0
    ; maybe unordered, maybe -0 vs +0, maybe just x1 < x0

Footnote 1: NaN encodings as part of the total order

FP values (and their sign/magnitude encodings) are symmetric around zero. The sign bit is always a sign bit, even for NaNs, and can thus be handled the same way.

The smallest magnitude are +-0.0 of course: all exponent and mantissa bits zero.
The subnormals have a zero exponent field (minimum value) implying a leading zero for the mantissa. The explicit part is non-zero. Magnitude is linear with mantissa. (Zero is actually just a special case of a subnormal.)
The normalized numbers span exponent = 1 to exponent < max, implying a leading 1 in the mantissa. The highest value within one exponent (all mantissa bits set) is just below the ++exponent; mantissa=0 value: i.e. increment by 1 with carry from mantissa to exponent increases to next representable float/double
+- Infinity has exponent = all-ones, mantissa = 0
+- NaN has exponent = all-ones, mantissa = non-zero
- on x86, sNaN has the highest bit of the mantissa cleared. Rest is payload with at least 1 set bit anywhere (else it's an Inf).
- on x86, qNaN has the highest bit of the mantissa set. Rest is payload

https://cwiki.apache.org/confluence/display/stdcxx/FloatingPoint (linked from Are the bit patterns of NaNs really hardware-dependent?) shows some sNaN and qNaN encodings on a couple other ISAs. Some differ from x86, but POWER and Alpha has the MSB of the mantissa set for qNaN so they have larger integer magnitude than any sNaN.

PA-RISC chose the other way around so implementing a total order predicate on that (obsolete) ISA would need to do extra work for the FP-compare unordered case; maybe flipping that bit in both values might work if either of them are any kind of NaN, before proceeding with integer compare.

(I mention this because the same algorithm could be used in higher level languages that might not be exclusively used on x86. But you might want to just leave it and always handle the same binary bit-patterns the same way, even if that means qNaN < sNaN on some platforms. You only even get sNaN in the first place by manually writing the bit-pattern.)

PS: I know "significand" is more technically correct, but "mantissa" has fewer syllables and I like it better, and is well understood in this context.

Do you know whether it's "better"/"faster" to operate with pcmp... on the floating point registers or move them to GP regs first and then do integer operations on them? — soc, Dec 16 '19 at 17:49
Ok, so in pseudo-code the logic should look like this, right? https://gist.github.com/soc/455aa2be8250537da3b5b3ac1d13578e — soc, Dec 16 '19 at 18:02
@soc: if you're using the result e.g. for a blend of SIMD registers, then very likely staying in SIMD is good if you can do it efficiently, especially if you want to actually do SIMD and produce 2 results at once. e.g. for a SIMD sorting network. If you want to branch on the result, you need it in GP regs at some point. It might be worth doing some work there before or in parallel with `movq` of both operands. OTOH, GP-integer compares run on a lot of ports, but CMOV can be clunky to use vs. SIMD compares producing a mask result. especially with AVX1 to avoid movdqa reg. copies — Peter Cordes, Dec 16 '19 at 18:31
It's a rather simple use-case, I just want to have a method `compareTotal(x: Double, y: Double): Int` that tells me whether the first value is smaller, equal or greater than the second one. — soc, Dec 16 '19 at 20:03
@soc: You're not even going to inline this into a loop that calls it?? Why are you writing it by hand in asm? Since you need the result as an `Int`, you might consider taking the FP bit patterns as 64-bit `Long` so the caller passes them in integer registers in the first place. Then if the caller is loading them from memory, it can just load straight into RDI and RSI instead of XMM regs. (Or RDX and RCX for Windows x64). — Peter Cordes, Dec 16 '19 at 22:32
No, that's fine. I just wanted to clarify with my comment that I'm interested in understanding the operation on a single value, not working on multiple values in a vectorized fashion. I'm building a compiler, so asm it is. :-) (Don't worry about "method", I meant it in a self-contained piece of code that has inputs and outputs.) — soc, Dec 17 '19 at 01:21
@soc: ok, well if you're compiling you can have multiple strategies; one for when the code branches on the total order, another for branchless. A SIMD strategy is also useful to have if you're auto-vectorizing code that uses it. If you'd just said you were building a compiler earlier, that would have been a lot clearer. But anyway, there's no single best strategy. branchy vs. branchless depends on surrounding code and input data; this is why real compilers like GCC/clang need profile-guided optimization to make good choices. Also, with data starting in memory, skipping XMM regs makes sense — Peter Cordes, Dec 17 '19 at 21:14
@soc: I updated with the start of an idea that uses `ucomisd` / `jz` to leave a `setb` fast-path. I'd guess that would be good if unordered or exact-equality are rare. — Peter Cordes, Dec 17 '19 at 21:15
@soc: updated with scalar branchless and branchy versions. There might be more room for optimization. — Peter Cordes, Dec 17 '19 at 23:01

How to implement the totalOrder predicate for floating point values?

1 Answers1