62

The compare function is a function that takes two arguments a and b and returns an integer describing their order. If a is smaller than b, the result is some negative integer. If a is bigger than b, the result is some positive integer. Otherwise, a and b are equal, and the result is zero.

This function is often used to parameterize sorting and searching algorithms from standard libraries.

Implementing the compare function for characters is quite easy; you simply subtract the arguments:

int compare_char(char a, char b)
{
    return a - b;
}

This works because the difference between two characters is generally assumed to fit into an integer. (Note that this assumption does not hold for systems where sizeof(char) == sizeof(int).)

This trick cannot work to compare integers, because the difference between two integers generally does not fit into an integer. For example, INT_MAX - (-1) = INT_MIN suggests that INT_MAX is smaller than -1 (technically, the overflow leads to undefined behavior, but let's assume modulo arithmetic).

So how can we implement the compare function efficiently for integers? Here is my first attempt:

int compare_int(int a, int b)
{
    int temp;
    int result;
    __asm__ __volatile__ (
        "cmp %3, %2 \n\t"
        "mov $0, %1 \n\t"

        "mov $1, %0 \n\t"
        "cmovg %0, %1 \n\t"

        "mov $-1, %0 \n\t"
        "cmovl %0, %1 \n\t"
    : "=r"(temp), "=r"(result)
    : "r"(a), "r"(b)
    : "cc");
    return result;
}

Can it be done in less than 6 instructions? Is there a less straightforward way that is more efficient?

Jonathan Leffler
  • 666,971
  • 126
  • 813
  • 1,185
fredoverflow
  • 237,063
  • 85
  • 359
  • 638
  • 27
    Have you disassembled `return (ab);`? – jxh Jun 12 '12 at 12:16
  • 5
    @user315052 Only five instructions, nice! Now I feel stupid :) – fredoverflow Jun 12 '12 at 12:19
  • On side note isn't `bool a_less_than_b(int a, int b) { return a < b; }` look better than `int compare_int(int a, int b);` ? – iammilind Jun 12 '12 at 12:42
  • Have you tried doing a normal subtraction and testing the carry/underflow flag? – edA-qa mort-ora-y Jun 12 '12 at 13:04
  • 24
    This is almost certainly pointless optimization. Any savings here is noise compared to the function call overhead and algorithmic costs in the caller. – Raymond Chen Jun 12 '12 at 13:49
  • 3
    @Raymond I'm doing this purely out of curiosity. Ain't assembly fun? ;) – fredoverflow Jun 12 '12 at 14:46
  • Hmmm. I think maybe `int compare_char(char a, char b) {return a-b;}` itself isn't guaranteed portable? I'm thinking the integral promotion from char to int occurs after the subtraction, so if the architecture supports byte-wide subtraction and the compiler uses it (or if the compiler masks off bits above 0xFF in the computation), the promotion to int will occur after the subtraction so the results aren't guaranteed to be correct. I think `int compare_char(char a, char b) {return ((int)a)-b;}` would avoid that uncertainty without losing any efficiency in the process. – phonetagger Jun 12 '12 at 15:26
  • 1
    @phonetagger Nope, promotion occurs *before* the subtraction. – fredoverflow Jun 12 '12 at 16:02
  • @RaymondChen Inlining … And as for the algorithmic cost, Bentley and McIlroy found comparison cost to be significant enough to merit an analysis in *Engineering a sort function* (but probably due to the indirection via a function pointer and the lack of inlining). – Konrad Rudolph Jun 12 '12 at 16:04
  • 17
    @KonradRudolph If this is inlined, then this is actually worse because the compiler cannot optimize inlined assembly. You're better off expressing it in C so the optimizer can optimize into the inline function. For example, your inline assembly prevents the compiler from optimizing `if (compare_int(a,b) < 0)` into `if (a < b)`. – Raymond Chen Jun 12 '12 at 16:10
  • @Raymond I’m forced to agree. – Konrad Rudolph Jun 12 '12 at 16:12
  • @RaymondChen I'm not exactly sure what you mean, but my GNU compiler can inline the `compare_int` function calls just fine. – fredoverflow Jun 12 '12 at 16:12
  • @RaymondChen Is it actually conceivable that a compiler optimizes `if (compare_int(a,b) < 0)` into `if (a < b)` without knowing anything about the "contract" of `compare_int`? I'm not a compiler writer, so I wouldn't know. – fredoverflow Jun 12 '12 at 16:21
  • 1
    @KonradRudolph It can inline it, but it probably cannot optimize it. Does it optimize `int result = compare_int(a,b); if (result < 0) {LT} else if (result == 0) {EQ} else {GT}` into `cmp a, b; jl caseLT; jz caseEQ; caseGT:`? Note that with this optimization, the `compare_int` function ended up having zero instructions! – Raymond Chen Jun 12 '12 at 16:21
  • 2
    @FredOverflow The compiler doesn't need to know the contract. It inlines `if (((a < b) ? -1 : (a > b)) < 0)` and then observes that the second operand of the ternary is always negative and the third operand never is. – Raymond Chen Jun 12 '12 at 16:26
  • @FredOverflow - I'm almost certain you're wrong about the promotion always occurring before the subtraction. On most architectures, there is no byte-wide subtraction, so the compiler is forced to use native 'int' for the subtraction. Thus it would promote before the subtraction. This is not a requirement, however. Try it with int64 and int32 in place of int and char; you'll see what I mean. Since the architecture has a 32-bit subtraction, it will use that and get the over/underflow you're talking about, even with a larger return type. – phonetagger Jun 12 '12 at 17:43
  • ...continuing: The only reason your `int compare_char()` works is because the compiler has no byte-wide subtraction instruction, so it's REQUIRED to perform the promotion before the subtraction. So the only way to GUARANTEE the promotion occurs before the subtraction is to explicitly cast one (or both) of the terms in the expression to the larger type beforehand. – phonetagger Jun 12 '12 at 17:45
  • 2
    @phonetagger The C++ standard guarantees that the promotion occurs *before* the subtraction. See `5.7 [expr.add] §1` and `5 [expr] §9` and `4.5 [conv.prom]`. – fredoverflow Jun 12 '12 at 18:27
  • 1
    @phonetagger The C standard even has an example `char c1, c2; /* ... */ c1 = c1 + c2;` explaining "the integer promotions **require** that the abstract machine promote the value of each variable to `int` size and then add the two `int`s [...]". – fredoverflow Jun 12 '12 at 18:34
  • 1
    are you sure you can do this kind simple stuff better than a compiler tuned for years ?!! – zinking Jun 15 '12 at 03:27
  • 22
    Raymond Chen just posted an [interesting writeup](https://blogs.msdn.microsoft.com/oldnewthing/20171117-00/?p=97416) about benchmarking this and what optimizers can do with integer comparisons. – isanae Nov 17 '17 at 19:49

7 Answers7

96

This one has no branches, and doesn't suffer from overflow or underflow:

return (a > b) - (a < b);

With gcc -O2 -S, this compiles down to the following six instructions:

xorl    %eax, %eax
cmpl    %esi, %edi
setl    %dl
setg    %al
movzbl  %dl, %edx
subl    %edx, %eax

Here's some code to benchmark various compare implementations:

#include <stdio.h>
#include <stdlib.h>

#define COUNT 1024
#define LOOPS 500
#define COMPARE compare2
#define USE_RAND 1

int arr[COUNT];

int compare1 (int a, int b)
{
    if (a < b) return -1;
    if (a > b) return 1;
    return 0;
}

int compare2 (int a, int b)
{
    return (a > b) - (a < b);
}

int compare3 (int a, int b)
{
    return (a < b) ? -1 : (a > b);
}

int compare4 (int a, int b)
{
    __asm__ __volatile__ (
        "sub %1, %0 \n\t"
        "jno 1f \n\t"
        "cmc \n\t"
        "rcr %0 \n\t"
        "1: "
    : "+r"(a)
    : "r"(b)
    : "cc");
    return a;
}

int main ()
{
    for (int i = 0; i < COUNT; i++) {
#if USE_RAND
        arr[i] = rand();
#else
        for (int b = 0; b < sizeof(arr[i]); b++) {
            *((unsigned char *)&arr[i] + b) = rand();
        }
#endif
    }

    int sum = 0;

    for (int l = 0; l < LOOPS; l++) {
        for (int i = 0; i < COUNT; i++) {
            for (int j = 0; j < COUNT; j++) {
                sum += COMPARE(arr[i], arr[j]);
            }
        }
    }

    printf("%d=0\n", sum);

    return 0;
}

The results on my 64-bit system, compiled with gcc -std=c99 -O2, for positive integers (USE_RAND=1):

compare1: 0m1.118s
compare2: 0m0.756s
compare3: 0m1.101s
compare4: 0m0.561s

Out of C-only solutions, the one I suggested was the fastest. user315052's solution was slower despite compiling to only 5 instructions. The slowdown is likely because, despite having one less instruction, there is a conditional instruction (cmovge).

Overall, FredOverflow's 4-instruction assembly implementation was the fastest when used with positive integers. However, this code only benchmarked the integer range RAND_MAX, so the 4-instuction test is biased, because it handles overflows separately, and these don't occur in the test; the speed may be due to successful branch prediction.

With a full range of integers (USE_RAND=0), the 4-instruction solution is in fact very slow (others are the same):

compare4: 0m1.897s
Ambroz Bizjak
  • 7,399
  • 1
  • 34
  • 44
  • Can you modify the benchmark to generate the full spectrum of integers? What happens if you replace `rand()` with `rand() | rand() << 17`? – fredoverflow Jun 12 '12 at 18:45
  • @AmbrozBizjak: Did you verify that your compiler generated the same 5 instructions that was posted in my answer? I'm just curious. – jxh Apr 26 '14 at 02:25
  • @jxh On my current system, I get: compare2 0m0.488s, compare3 0m0.502s. The assembly is not the same as posted but similar: http://ideone.com/uhmIYt and http://ideone.com/cD9Zd2. The problem with these benchmarks is that the results depend also on how the compiler manages to inline the code into the loop. It may be better if we force it to not inline the compare function. – Ambroz Bizjak Apr 27 '14 at 07:53
  • Note that in gcc 7.x (but not 6.x or 8.x), `compare3(...) { a < b ? -1 : (a > b); }` uses a branch - so you could be in for a nasty surprise if your comparison is unpredictable! – BeeOnRope Jan 20 '19 at 18:13
55

The following has always proven to be fairly efficient for me:

return (a < b) ? -1 : (a > b);

With gcc -O2 -S, this compiles down to the following five instructions:

xorl    %edx, %edx
cmpl    %esi, %edi
movl    $-1, %eax
setg    %dl
cmovge  %edx, %eax

As a follow-up to Ambroz Bizjak's excellent companion answer, I was not convinced that his program tested the same assembly code what was posted above. And, when I was studying the compiler output more closely, I noticed that the compiler was not generating the same instructions as was posted in either of our answers. So, I took his test program, hand modified the assembly output to match what we posted, and compared the resulting times. It seems the two versions compare roughly identically.

./opt_cmp_branchless: 0m1.070s
./opt_cmp_branch:     0m1.037s

I am posting the assembly of each program in full so that others may attempt the same experiment, and confirm or contradict my observation.

The following is the version with the cmovge instruction ((a < b) ? -1 : (a > b)):

        .file   "cmp.c"
        .text
        .section        .rodata.str1.1,"aMS",@progbits,1
.LC0:
        .string "%d=0\n"
        .text
        .p2align 4,,15
.globl main
        .type   main, @function
main:
.LFB20:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        pushq   %rbx
        .cfi_def_cfa_offset 24
        .cfi_offset 3, -24
        movl    $arr.2789, %ebx
        subq    $8, %rsp
        .cfi_def_cfa_offset 32
.L9:
        leaq    4(%rbx), %rbp
.L10:
        call    rand
        movb    %al, (%rbx)
        addq    $1, %rbx
        cmpq    %rbx, %rbp
        jne     .L10
        cmpq    $arr.2789+4096, %rbp
        jne     .L9
        xorl    %r8d, %r8d
        xorl    %esi, %esi
        orl     $-1, %edi
.L12:
        xorl    %ebp, %ebp
        .p2align 4,,10
        .p2align 3
.L18:
        movl    arr.2789(%rbp), %ecx
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
.L15:
        movl    arr.2789(%rax), %edx
        xorl    %ebx, %ebx
        cmpl    %ecx, %edx
        movl    $-1, %edx
        setg    %bl
        cmovge  %ebx, %edx
        addq    $4, %rax
        addl    %edx, %esi
        cmpq    $4096, %rax
        jne     .L15
        addq    $4, %rbp
        cmpq    $4096, %rbp
        jne     .L18
        addl    $1, %r8d
        cmpl    $500, %r8d
        jne     .L12
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
        addq    $8, %rsp
        .cfi_def_cfa_offset 24
        xorl    %eax, %eax
        popq    %rbx
        .cfi_def_cfa_offset 16
        popq    %rbp
        .cfi_def_cfa_offset 8
        ret
        .cfi_endproc
.LFE20:
        .size   main, .-main
        .local  arr.2789
        .comm   arr.2789,4096,32
        .section        .note.GNU-stack,"",@progbits

The version below uses the branchless method ((a > b) - (a < b)):

        .file   "cmp.c"
        .text
        .section        .rodata.str1.1,"aMS",@progbits,1
.LC0:
        .string "%d=0\n"
        .text
        .p2align 4,,15
.globl main
        .type   main, @function
main:
.LFB20:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        pushq   %rbx
        .cfi_def_cfa_offset 24
        .cfi_offset 3, -24
        movl    $arr.2789, %ebx
        subq    $8, %rsp
        .cfi_def_cfa_offset 32
.L9:
        leaq    4(%rbx), %rbp
.L10:
        call    rand
        movb    %al, (%rbx)
        addq    $1, %rbx
        cmpq    %rbx, %rbp
        jne     .L10
        cmpq    $arr.2789+4096, %rbp
        jne     .L9
        xorl    %r8d, %r8d
        xorl    %esi, %esi
.L19:
        movl    %ebp, %ebx
        xorl    %edi, %edi
        .p2align 4,,10
        .p2align 3
.L24:
        movl    %ebp, %ecx
        xorl    %eax, %eax
        jmp     .L22
        .p2align 4,,10
        .p2align 3
.L20:
        movl    arr.2789(%rax), %ecx
.L22:
        xorl    %edx, %edx
        cmpl    %ebx, %ecx
        setg    %cl
        setl    %dl
        movzbl  %cl, %ecx
        subl    %ecx, %edx
        addl    %edx, %esi
        addq    $4, %rax
        cmpq    $4096, %rax
        jne     .L20
        addq    $4, %rdi
        cmpq    $4096, %rdi
        je      .L21
        movl    arr.2789(%rdi), %ebx
        jmp     .L24
.L21:
        addl    $1, %r8d
        cmpl    $500, %r8d
        jne     .L19
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
        addq    $8, %rsp
        .cfi_def_cfa_offset 24
        xorl    %eax, %eax
        popq    %rbx
        .cfi_def_cfa_offset 16
        popq    %rbp
        .cfi_def_cfa_offset 8
        ret
        .cfi_endproc
.LFE20:
        .size   main, .-main
        .local  arr.2789
        .comm   arr.2789,4096,32
        .section        .note.GNU-stack,"",@progbits
Community
  • 1
  • 1
jxh
  • 64,506
  • 7
  • 96
  • 165
  • 3
    +1; an added benefit is it doesn't rely on the `asm` keyword which, as I'm sure you know, won't work on all platforms. – John Dibling Jun 12 '12 at 12:54
  • 1
    This will work only on the i686 and later families of x86 CPUs, e.g. the ones with the CMOV instruction available. Practically, it's whatever CPU that was made after the Pentium Pro (except Pentium MMX), so you should be fine unless one of your users happens to have a >10 years old machine. – Daniel Kamil Kozar Jun 12 '12 at 13:57
  • 3
    @Daniel: if you just use the C version though then you should get pretty much optimal code regardless (assuming a half-decent compiler) – Paul R Jun 12 '12 at 17:52
  • Of course, you're right. I was just pointing out the facts coming from this particular code. – Daniel Kamil Kozar Jun 12 '12 at 20:52
  • @FredOverflow: Does the `cmovge` really result in a pipeline stall? There doesn't seem to be a conditional jump in the assembly output you show. What version of the compiler did you use? – jxh Apr 25 '14 at 23:41
16

Okay, I managed to get it down to four instructions :) The basic idea is as follows:

Half the time, the difference is small enough to fit into an integer. In that case, just return the difference. Otherwise, shift the number one to the right. The crucial question is what bit to shift into the MSB then.

Let's look at two extreme examples, using 8 bits instead of 32 bits for the sake of simplicity:

 10000000 INT_MIN
 01111111 INT_MAX
---------
000000001 difference
 00000000 shifted

 01111111 INT_MAX
 10000000 INT_MIN
---------
111111111 difference
 11111111 shifted

Shifting the carry bit in would yield 0 for the first case (although INT_MIN is not equal to INT_MAX) and some negative number for the second case (although INT_MAX is not smaller than INT_MIN).

But if we flip the carry bit before doing the shift, we get sensible numbers:

 10000000 INT_MIN
 01111111 INT_MAX
---------
000000001 difference
100000001 carry flipped
 10000000 shifted

 01111111 INT_MAX
 10000000 INT_MIN
---------
111111111 difference
011111111 carry flipped
 01111111 shifted

I'm sure there's a deep mathematical reason why it makes sense to flip the carry bit, but I don't see it yet.

int compare_int(int a, int b)
{
    __asm__ __volatile__ (
        "sub %1, %0 \n\t"
        "jno 1f \n\t"
        "cmc \n\t"
        "rcr %0 \n\t"
        "1: "
    : "+r"(a)
    : "r"(b)
    : "cc");
    return a;
}

I have tested the code with one million random inputs plus every combination of INT_MIN, -INT_MAX, INT_MIN/2, -1, 0, 1, INT_MAX/2, INT_MAX/2+1, INT_MAX. All tests passed. Can you proove me wrong?

fredoverflow
  • 237,063
  • 85
  • 359
  • 638
  • Even though it's 4 instructions, you may still find that a 5 or 6 instruction branchless sequence is faster. – Paul R Jun 12 '12 at 17:53
  • 1
    @mfontanini I expect my solution to be the slowest by far for random inputs. If the overflow rarely happens, on the other hand, I would expect it to perform very well (branch prediction). Anyway, I just found it challenging to find an even shorter solution :) – fredoverflow Jun 12 '12 at 18:20
  • 1
    You will see that the inverted carry indeed equals the correct sign upon signed overflow if you write out the truth table for a one-bit full subtractor. In the process keep in mind that for the correct sign and signed overflow calculations `minuend bit=1` and `subtrahend bit=1` should be treated as `-1` arithmetically and the sum isn't overflowed IFF it's `0` or `-1`. So, your conjecture is correct. – Alexey Frunze Jun 14 '12 at 09:26
10

For what it's worth I put together an SSE2 implementation. vec_compare1 uses the same approach as compare2 but requires just three SSE2 arithmetic instructions:

#include <stdio.h>
#include <stdlib.h>
#include <emmintrin.h>

#define COUNT 1024
#define LOOPS 500
#define COMPARE vec_compare1
#define USE_RAND 1

int arr[COUNT] __attribute__ ((aligned(16)));

typedef __m128i vSInt32;

vSInt32 vec_compare1 (vSInt32 va, vSInt32 vb)
{
    vSInt32 vcmp1 = _mm_cmpgt_epi32(va, vb);
    vSInt32 vcmp2 = _mm_cmpgt_epi32(vb, va);
    return _mm_sub_epi32(vcmp2, vcmp1);
}

int main ()
{
    for (int i = 0; i < COUNT; i++) {
#if USE_RAND
        arr[i] = rand();
#else
        for (int b = 0; b < sizeof(arr[i]); b++) {
            *((unsigned char *)&arr[i] + b) = rand();
        }
#endif
    }

    vSInt32 vsum = _mm_set1_epi32(0);

    for (int l = 0; l < LOOPS; l++) {
        for (int i = 0; i < COUNT; i++) {
            for (int j = 0; j < COUNT; j+=4) {
                vSInt32 v1 = _mm_loadu_si128(&arr[i]);
                vSInt32 v2 = _mm_load_si128(&arr[j]);
                vSInt32 v = COMPARE(v1, v2);
                vsum = _mm_add_epi32(vsum, v);
            }
        }
    }

    printf("vsum = %vd\n", vsum);

    return 0;
}

Time for this is 0.137s.

Time for compare2 with the same CPU and compiler is 0.674s.

So the SSE2 implementation is around 4x faster, as might be expected (since it's 4-wide SIMD).

Paul R
  • 195,989
  • 32
  • 353
  • 519
3

This code has no branches and uses 5 instructions. It may outperform other branch-less alternatives on recent Intel processors, where cmov* instructions are quite expensive. Disadvantage is non-symmetrical return value (INT_MIN+1, 0, 1).

int compare_int (int a, int b)
{
    int res;

    __asm__ __volatile__ (
        "xor %0, %0 \n\t"
        "cmpl %2, %1 \n\t"
        "setl %b0 \n\t"
        "rorl $1, %0 \n\t"
        "setnz %b0 \n\t"
    : "=q"(res)
    : "r"(a)
    , "r"(b)
    : "cc"
    );

    return res;
}

This variant does not need initialization, so it uses only 4 instructions:

int compare_int (int a, int b)
{
    __asm__ __volatile__ (
        "subl %1, %0 \n\t"
        "setl %b0 \n\t"
        "rorl $1, %0 \n\t"
        "setnz %b0 \n\t"
    : "+q"(a)
    : "r"(b)
    : "cc"
    );

    return a;
}
Evgeny Kluev
  • 23,617
  • 7
  • 50
  • 90
0

Maybe you can use the following idea (in pseudo-code; didn't write asm-code because i am not comfortable with syntax):

  1. Subtract the numbers (result = a - b)
  2. If no overflow, done (jo instruction and branch prediction should work very well here)
  3. If there was overflow, use any robust method (return (a < b) ? -1 : (a > b))

Edit: for additional simplicity: if there was overflow, flip the sign of the result, instead of step 3.

anatolyg
  • 23,079
  • 7
  • 51
  • 113
-2

You could consider promoting the integers to 64bit values.

Puppy
  • 138,897
  • 33
  • 232
  • 446
  • 1
    You *could*, but since you have to return an int you would still need to deal with any overflow. – Paul R Jun 12 '12 at 12:35
  • @PaulR: Just mask out the extra bits. – Puppy Jun 12 '12 at 12:39
  • 3
    No - it would need more than that - consider the case where `a - b` can't be represented in 32 bits - the resulting sign if you just mask the high bits will be incorrect. You would need to e.g. saturate at 64 bits before converting back to 32 bits. This would take a lot more than 5 instructions. – Paul R Jun 12 '12 at 12:43
  • @PaulR: x64 has arithmetic shift instructions you can use. – Puppy Jun 12 '12 at 12:59
  • 1
    @DeadMG: But those would round either -1 or +1 to zero, incorrectly. I'd think along lines of `x = x | x>>32`, logical shift. That gets the sign bit where it should be, and ensures that only `0LL` ends up as zero. The one bug it has is that `0x80000000LL` is treated as a negative value. – MSalters Jun 12 '12 at 13:28
  • @PaulR When you said "saturate", I immediately had goosebumps on my neck, because MMX has saturated integer arithmetic! As it turns out though, only for 8 bit and 16 bit integers :( – fredoverflow Jun 12 '12 at 17:35
  • @FredOverflow: yes, even SSE2/3/4 and AVX2 still only have 8 and 16 bit add/subtract with saturation, unfortunately. – Paul R Jun 12 '12 at 17:50
  • @PaulR I wrote an MMX solution with 7 instructions (mov, mov, mov, cmp, cmp, sub, mov). If you process two integer pairs in parallel, you get 3.5 instructions per pair. Of course this is a bit unrealistic, but still, interesting :) – fredoverflow Jun 12 '12 at 18:02
  • @FredOverflow: yes, interesting, but probably not very useful for e.g. a comparison function for a sort algorithm. Note that with SSE you can probably use max/min instructions to do an efficient implementation for 4 ints in parallel. – Paul R Jun 12 '12 at 18:04
  • @PaulR How exactly would max/min help? Just subtract and then manually cap at INT_MIN and INT_MAX? – fredoverflow Jun 12 '12 at 18:17
  • @FredOverflow: I haven't thought it through, but you can use max/min and subtract to get unsigned absolute difference which can't overflow - from there it seems it should be possible to get to the desired result. When I have some time I'll try and do an efficient SSE implementation. – Paul R Jun 12 '12 at 20:50
  • @FredOverflow: OK - forget the previous idea - you can do it with two compares and a subtract - I'll post the code separately as an answer. – Paul R Jun 12 '12 at 21:19
  • @PaulR Two compares and a subtract is exactly what I came up with yesterday, see comment #8 :) – fredoverflow Jun 13 '12 at 05:28