45

As part of a program that I'm writing, I need to compare two values in the form a + sqrt(b) where a and b are unsigned integers. As this is part of a tight loop, I'd like this comparison to run as fast as possible. (If it matters, I'm running the code on x86-64 machines, and the unsigned integers are no larger than 10^6. Also, I know for a fact that a1<a2.)

As a stand-alone function, this is what I'm trying to optimize. My numbers are small enough integers that double (or even float) can exactly represent them, but rounding error in sqrt results must not change the outcome.

// known pre-condition: a1 < a2  in case that helps
bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    return a1+sqrt(b1) < a2+sqrt(b2);  // computed mathematically exactly
}

Test case: is_smaller(900000, 1000000, 900001, 998002) should return true, but as shown in comments by @wim computing it with sqrtf() would return false. So would (int)sqrt() to truncate back to integer.

a1+sqrt(b1) = 90100 and a2+sqrt(b2) = 901000.00050050037512481206. The nearest float to that is exactly 90100.


As the sqrt() function is generally quite expensive even on modern x86-64 when fully inlined as a sqrtsd instruction, I'm trying to avoid calling sqrt() as far as possible.

Removing sqrt by squaring potentially also avoids any danger of rounding errors by making all computation exact.

If instead the function was something like this ...

bool is_smaller(unsigned a1, unsigned b1, unsigned x) {
    return a1+sqrt(b1) < x;
}

... then I could just do return x-a1>=0 && static_cast<uint64_t>(x-a1)*(x-a1)>b1;

But now since there are two sqrt(...) terms, I cannot do the same algebraic manipulation.

I could square the values twice, by using this formula:

      a1 + sqrt(b1) = a2 + sqrt(b2)
<==>  a1 - a2 = sqrt(b2) - sqrt(b1)
<==>  (a1 - a2) * (a1 - a2) = b1 + b2 - 2 * sqrt(b1) * sqrt(b2)
<==>  (a1 - a2) * (a1 - a2) = b1 + b2 - 2 * sqrt(b1 * b2)
<==>  (a1 - a2) * (a1 - a2) - (b1 + b2) = - 2 * sqrt(b1 * b2)
<==>  ((b1 + b2) - (a1 - a2) * (a1 - a2)) / 2 = sqrt(b1 * b2)
<==>  ((b1 + b2) - (a1 - a2) * (a1 - a2)) * ((b1 + b2) - (a1 - a2) * (a1 - a2)) / 4 = b1 * b2

Unsigned division by 4 is cheap because it is just a bitshift, but since I square the numbers twice I will need to use 128-bit integers and I will need to introduce a few >=0 checks (because I'm comparing inequality instead of equality).

It feels like there might be a way do this faster, by applying better algebra to this problem. Is there a way to do this faster?

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
Bernard
  • 4,001
  • 1
  • 23
  • 48
  • 8
    Just an observation: if `a1+sqrt(b1) – 500 - Internal Server Error May 08 '19 at 09:06
  • http://assemblyrequired.crashworks.org/timing-square-root/ – Marek R May 08 '19 at 09:15
  • 4
    you can also observe the fact that max(sqrt(b)) = 1000 if b <= 10^6. so you only need to investigate further if abs(a1-a2) <= 1000. otherwise there is always unequality – StPiere May 08 '19 at 09:20
  • Clang actually does a pretty decent job of optimizing this with FP shuffles, to do both sqrt operations in parallel. (Use signed `int` so it can do SIMD packed int->FP conversion cheaply) https://godbolt.org/z/GvNe2B. (includes versions with float and double; clang vectorizes better with double, but `sqrtpd` is slower than `sqrtps`) But yeah, multiplying twice with scalar 128-bit math might be good, especially if we don't care about Intel before Broadwell (where `adc` was 2 uops). – Peter Cordes May 08 '19 at 09:23
  • 2
    I'm afraid you are going too fast into the code: there's another question "https://stackoverflow.com/questions/52807071/how-to-check-that-a-point-is-inside-the-given-radius/52816599#52816599", where I've given a way to reduce floating point arithmetic, just by interpreting the use case. Can you explain us the use case, maybe we might come up with a better solution? (What's the meaning of "a1+sqrt(b1) – Dominique May 08 '19 at 09:25
  • another brute-force approach would be lookup array: x2->x: [0-10^6] -> [0-1000]. you'd have to measeure this ... more space needed (2byte integer x 10^6 integers)-> 2 * 10^6 bytes .. – StPiere May 08 '19 at 09:29
  • @MarekR: That's assuming your number already started as a `float` or `double`, not integer. `cvtsi2ss` isn't free. And it's on a Core 2, which has much slower hardware sqrt than Skylake (6 to 29 cycles per instruction on Core2, with the fastest numbers only for trivial cases like 0). Skylake has fixed one per 3-cycle throughput for single-precision FP sqrt `sqrtps` (but still 12 cycle latency). https://agner.org/optimize/ But yes, there might be something we can gain from `sqrt(x) ~= x * rsqrtps(x)` if we exclude the `x==0` case. That might be enough precision to round to nearest integer. – Peter Cordes May 08 '19 at 09:29
  • @StPiere That would work for my simpler example (with 3 parameters only). But when I have square roots on both sides on the inequality, I don't think I can use a lookup array... (what if I'm trying to do `is_smaller(1, 7, 2, 3)` ?) – Bernard May 08 '19 at 09:33
  • 2
    @StPiere: Using a LUT for sqrt is going to be horrible on modern x86 if inputs are fairly uniformly distributed. A 4MiB cache footprint is way bigger than L2 cache size (typically 256kiB), so you'll mostly be getting L3 hits at best, like 45 cycle latency on Skylake. But even on a really old Core 2, single-precision sqrt has worst-case 29 cycle latency. (With a couple more cycles to convert to FP in the first place). On Skylake, FP sqrt latency ~= L2 cache hit latency and is pipelined at throughput = latency/4. Not to mention the impact of cache pollution on other code. – Peter Cordes May 08 '19 at 09:33
  • @StPiere Since `a1 < a2`, the `abs()` can be removed too. – Acorn May 08 '19 at 09:56
  • 3
    Since `a1 < a2`, you can already exclude directly all cases where `b1 < b2` – kvantour May 08 '19 at 10:04
  • @Peter Cordes: yes, you're probably right. But I would still measure it first. We know nothing about how uniformly the data distribution is or about the problem itself. For LUT values 2 Byte suffiicies (<=1000), so I come to 2MB needed, but this doesnt change much. Measure first then decide. It is possible (althgough with small probability) that for this problem LUT outperformes the sqrt calculation if data distribution is nonuniform. Looking at the problem from multiple sides is always good – StPiere May 08 '19 at 10:33
  • @StPiere: An early-out fast path using integer checks is what I'd investigate first. Only if it was branch-mispredicting too much would I even consider a giant LUT. (But yes, 2MB, good point that we can use a narrow type for LUT values.) Cache misses lead to occasional *big* stalls too big for OoO exec to hide. But depending how `sqrt` latency fits into the critical path, out-of-order exec may be able to mostly hide it. (But sqrt throughput is definitely a concern). – Peter Cordes May 08 '19 at 10:50
  • @500-InternalServerError you can actually avoid the `sqrt` here. Set `t=a2-a1`, which is positive. You know that you want to test `a1+sqrt(b1) < a2+sqrt(b2)` which is equivalent to testing `sqrt(b1) - sqrt(b2) < t`. So the quick way is to test `sqrt(b1) < t` which is equivlanet to testing `b1 < t*t`, but `t*t` might be to big, so you could just test with integer divistion if `(b1/t)/t == 0` – kvantour May 08 '19 at 10:52
  • Very important: How is the distribution of those values? Are they likely clustered or uniformly distributed? – Daniel Jour May 08 '19 at 15:53
  • @DanielJour They are reasonably uniformly distributed in range [0, 10^6]. – Bernard May 08 '19 at 16:15
  • 2
    Note that the results with single precision: `return a1+sqrtf(b1) < a2+sqrtf(b2);`, differ from the double precision results (`return a1+sqrt(b1) < a2+sqrt(b2);`), for `a1 = 900000; b1 = 1000000; a2 = 900001; b2 = 998002;` which is true in double precision and false in in single precision. This suggests that you shouldn't use fast approximate `sqrt` methods. – wim May 08 '19 at 21:17
  • @wim: there should be enough precision; maybe we need to `roundps` to the nearest integer, with nearest or floor. Or convert back to integer with truncation. Maybe I made a mistake when I moved the OP's pseudocode-comment in the question into actual C, because I assumed computing in FP would be equivalent to integer-sqrt (i.e. floor), in case that's what they meant. – Peter Cordes May 08 '19 at 22:44
  • 1
    @wim: the exact results there are `a1+sqrt(b1) = 90100` and `a2+sqrt(b2) = 901000.00050050037512481206`. The nearest `float` to that is `90100`, so they're equal not less. https://www.h-schmidt.net/FloatConverter/IEEE754.html. So IDK whether the OP wants `a1 + (int)sqrt(b1)` or `a1 + lrint(sqrt(b1))`, or exact (for which double-precision may do the job), or what. – Peter Cordes May 08 '19 at 22:54
  • @PeterCordes; (replaces an earlier comment which was wrongly formatted). With an integer `sqrt` we don't have, for example, `10 + sqrt(10) < 10 +sqrt(11)`, because both side are equal. That would be undesirable I think. Because, after some algebraic integer manipulations you may find that actually `10 + sqrt(10) < 10 +sqrt(11)`. Therefore a double precision `a1+sqrt(b1) < a2+sqrt(b2);` seems most reasonable to me. – wim May 08 '19 at 23:10
  • @PeterCordes I want an _exact_ comparison, so `is_smaller(900000, 1000000, 900001, 998002)` should return true. – Bernard May 09 '19 at 09:44
  • 1
    I updated your question with that test case, that's extremely helpful and definitely rules out that way of using `float` like @wim proved. Any comment on expected distribution of inputs? If we assume *uniform*, then [\@EricTowers calculated](https://stackoverflow.com/questions/56037046/comparing-two-values-in-the-form-a-sqrtb-as-fast-as-possible?noredirect=1#comment98741615_56047494) that even the simple `a2-a1 < 1000` early-out check rejects 99.8% of all inputs, and thus might be better than geza's check that requires a multiply to reject 99.946%. But most numbers in software aren't uniform. – Peter Cordes May 09 '19 at 10:13
  • 1
    @PeterCordes I'm not too sure about the expected distribution of inputs. I'll need to do some testing/profiling of different solutions. – Bernard May 10 '19 at 04:40

5 Answers5

19

Here's a version without sqrt, though I'm not sure whether it is faster than a version which has only one sqrt (it may depend on the distribution of values).

Here's the math (how to remove both sqrts):

ad = a2-a1
bd = b2-b1

a1+sqrt(b1) < a2+sqrt(b2)              // subtract a1
   sqrt(b1) < ad+sqrt(b2)              // square it
        b1  < ad^2+2*ad*sqrt(b2)+b2    // arrange
   ad^2+bd  > -2*ad*sqrt(b2)

Here, the right side is always negative. If the left side is positive, then we have to return true.

If the left side is negative, then we can square the inequality:

ad^4+bd^2+2*bd*ad^2 < 4*ad^2*b2

The key thing to notice here is that if a2>=a1+1000, then is_smaller always returns true (because the maximum value of sqrt(b1) is 1000). If a2<=a1+1000, then ad is a small number, so ad^4 will always fit into 64 bit (there is no need for 128-bit arithmetic). Here's the code:

bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    int ad = a2 - a1;
    if (ad>1000) {
        return true;
    }

    int bd = b2 - b1;
    if (ad*ad+bd>0) {
        return true;
    }

    int ad2 = ad*ad;

    return (long long int)ad2*ad2 + (long long int)bd*bd + 2ll*bd*ad2 < 4ll*ad2*b2;
}

EDIT: As Peter Cordes noticed, the first if is not necessary, as the second if handles it, so the code becomes smaller and faster:

bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    int ad = a2 - a1;
    int bd = b2 - b1;
    if ((long long int)ad*ad+bd>0) {
        return true;
    }

    int ad2 = ad*ad;
    return (long long int)ad2*ad2 + (long long int)bd*bd + 2ll*bd*ad2 < 4ll*ad2*b2;
}
geza
  • 26,117
  • 6
  • 47
  • 111
  • 2
    It might be better to omit the `ad>1000` branch; I think the `ad*ad+bd>0` branch covers everything. One branch that's true most of the time is better than 2 branches that are each taken some of the time, for branch prediction. Unless the `ad>1000` check catches most of the inputs, it's going to be worth doing 1 extra `sub` and `imul`. (And probably a `movzx` to 64-bit) – Peter Cordes May 08 '19 at 11:48
  • @PeterCordes: nice observation (as usual :) ), thanks! – geza May 08 '19 at 11:51
  • Oh, were you doing that so `ad*ad` is known to not overflow an `int32_t`? After inlining, an x86-64 compiler can optimize away zero-extension to 64-bit (because writing a 32-bit register does that implicitly), so we could promote the unsigned inputs to `uint64_t` and then subtract to get `int64_t`. (32-bit subtract would require a `movsxd` sign-extension of a possibly-negative result, so avoid that.) – Peter Cordes May 08 '19 at 11:51
  • @PeterCordes: almost :) I added it, so `ad^4` surely doesn't overflow 64-bit. But yes, as you say, doing the `ad*ad` multiplication as 64-bit easily handles the case. – geza May 08 '19 at 11:55
  • Your fast version could avoid some sign-extension instructions by zero-extending `a2-a1` to 64-bit with `int64_t ad = a2 - a1`. (`a2-a1` has type `unsigned`, so it zero-extends. And that's free on x86-64.) `bd` needs to zero-extend the inputs to the subtraction, because we don't know if it's positive or negative. Or it needs to sign-extend the 32-bit result, but that's worse after inlining into a caller where the unsigned inputs are already known to be zero-extended in registers. (That's always the case unless they came in as function args or return values and haven't been touched yet.) – Peter Cordes May 08 '19 at 12:06
  • But yes, nice, transforming the fallback to avoid ever needing `sqrt` is great, even better than just branching to sometimes avoid a `sqrt`. – Peter Cordes May 08 '19 at 12:08
  • I was wondering if we can do anything with the fact(?) that `sqrt( b2-b1 ) <= sqrt(b2) - sqrt(b1)` (for `b2 >= b1`, and they're both integers >= 1.0, or they're zero. sqrt makes non-zero numbers closer to 1.0. So we could say that sqrt makes numbers closer to 0). https://www.wolframalpha.com/input/?i=sqrt(x-y)+-+(sqrt(x)+-+sqrt(y)) shows (I think) that the inequality is true: it finds global minima of 0 for x=y or y=0. So IDK if we can do an even better pre-check in terms of that which excludes more values, or if that would just give us a 2nd pre-check for some definitely-false inputs. – Peter Cordes May 08 '19 at 13:02
  • Probably you made a mistake somewhere. In my tests, for `a1 = 1000000; b1 = 100000; a2 = 999100; b2 = 700000;` the results differ from the original `return a1+sqrt(b1) < a2+sqrt(b2);`. See [Godbolt link](https://godbolt.org/z/-QlAik). – wim May 08 '19 at 22:25
  • My example above is wrong. I overlooked that we know that `a1 – wim May 09 '19 at 06:33
  • 1
    @wim: I've verified that my solution is OK. But, I've found that there is some problem with Brendan's second version (and maybe with the first as well) :) – geza May 09 '19 at 07:48
  • @Geza: You are right. I focused too much on `a1+sqrt(b1) < a2+sqrt(b2)`, without paying attention to `a1 < a2`. – wim May 09 '19 at 13:27
4

I'm tired and probably made a mistake; but I'm sure if I did someone will point it out..

bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    a_diff = a1-a2;   // May be negative

    if(a_diff < 0) {
        if(b1 < b2) {
            return true;
        }
        temp = a_diff+sqrt(b1);
        if(temp < 0) {
            return true;
        }
        return temp*temp < b2;
    } else {
        if(b1 >= b2) {
            return false;
        }
    }
//  return a_diff+sqrt(b1) < sqrt(b2);

    temp = a_diff+sqrt(b1);
    return temp*temp < b2;
}

If you know a1 < a2 then it could become:

bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    a_diff = a2-a1;    // Will be positive

    if(b1 > b2) {
        return false;
    }
    if(b1 >= a_diff*a_diff) {
        return false;
    }
    temp = a_diff+sqrt(b2);
    return b1 < temp*temp;
}
Brendan
  • 26,293
  • 1
  • 28
  • 50
  • 2
    We know `a1 < a2`, so no need to test for `a_diff < 0`. But maybe it is worth to test it against `> 1000` (reversed). – Acorn May 08 '19 at 10:16
  • 1
    You need a signed difference for `int a_diff`, and if you want to square it to check the condition without doing an actual sqrt you want `int64_t`. – Peter Cordes May 08 '19 at 10:22
  • Heh - I did make a mistake (if `a_diff+sqrt(b1);` is negative then you can't square it without borking the sign) - fixed. Also added the "if you know a1 < a2". – Brendan May 08 '19 at 10:25
  • Just define `a_diff = a2 - a1` for the second case. The first check should be `b1 <= b2`. The first square root you can delay if you just check if `b1/a_diff < a_diff` – kvantour May 08 '19 at 10:27
  • I'm also tired. Half-finished attempt: https://godbolt.org/z/U758t6. I think we can use `sqrt( abs(b1-b2) ) <= sqrt(b1) - sqrt(b2)` (which is true for numbers >= 1.0, and for 0). (ping @kvantour). But maybe better to just check `b1 < b2` first / instead, and `abs(a1-a2) <= 1000` – Peter Cordes May 08 '19 at 10:32
  • @kvantour: You're right (and doing `a_diff = a2 - a1` cleaned it up a lot). Didn't quite follow the `b1/a_diff < a_diff` so I did it differently and probably got that wrong. Will check back in about 10 hours. – Brendan May 08 '19 at 10:48
  • @Brendan HAve a look at [this comment](https://stackoverflow.com/questions/56037046/comparing-two-values-in-the-form-a-sqrtb-as-fast-as-possible#comment98720260_56037046). Also you could first do the test on `b` and then compute `a_diff` – kvantour May 08 '19 at 10:53
  • the first algorithm, the `if (b1 < b2)` should be `if (b1 <= b2)` – jameh May 09 '19 at 05:20
2

There is also newton method for calculating integer sqrts as described here Another approach would be to not calculate square root, but searching for floor(sqrt(n)) via binary search ... there are "only" 1000 full square numbers less than 10^6. This has probably bad performance, but would be an interesting approach. I haven't measure any of these, but here are examples:

#include <iostream>
#include <array>
#include <algorithm>        // std::lower_bound
#include <cassert>          


bool is_smaller_sqrt(unsigned a1, unsigned b1, unsigned a2, unsigned b2)
{
    return a1 + sqrt(b1) < a2 + sqrt(b2);
}

static std::array<int, 1001> squares;

template <typename C>
void squares_init(C& c)
{
    for (int i = 0; i < c.size(); ++i)
        c[i] = i*i;
}

inline bool greater(const int& l, const int& r)
{
    return r < l;
}

inline bool is_smaller_bsearch(unsigned a1, unsigned b1, unsigned a2, unsigned b2)
{
    // return a1 + sqrt(b1) < a2 + sqrt(b2)

    // find floor(sqrt(b1)) - binary search withing 1000 elems
    auto it_b1 = std::lower_bound(crbegin(squares), crend(squares), b1, greater).base();

    // find floor(sqrt(b2)) - binary search withing 1000 elems
    auto it_b2 = std::lower_bound(crbegin(squares), crend(squares), b2, greater).base();

    return (a2 - a1) > (it_b1 - it_b2);
}

unsigned int sqrt32(unsigned long n)
{
    unsigned int c = 0x8000;
    unsigned int g = 0x8000;

    for (;;) {
        if (g*g > n) {
            g ^= c;
        }

        c >>= 1;

        if (c == 0) {
            return g;
        }

        g |= c;
    }
}

bool is_smaller_sqrt32(unsigned a1, unsigned b1, unsigned a2, unsigned b2)
{
    return a1 + sqrt32(b1) < a2 + sqrt32(b2);
}

int main()
{
    squares_init(squares);

    // now can use is_smaller
    assert(is_smaller_sqrt(1, 4, 3, 1) == is_smaller_sqrt32(1, 4, 3, 1));
    assert(is_smaller_sqrt(1, 2, 3, 3) == is_smaller_sqrt32(1, 2, 3, 3));
    assert(is_smaller_sqrt(1000, 4, 1001, 1) == is_smaller_sqrt32(1000, 4, 1001, 1));
    assert(is_smaller_sqrt(1, 300, 3, 200) == is_smaller_sqrt32(1, 300, 3, 200));
}
StPiere
  • 3,910
  • 12
  • 21
  • Your integer `sqrt32` runs a fixed 16 iterations of that loop, I think. You could maybe start it from a smaller position based on a bit-scan to find the highest set bit in `n` and divide by 2. Or just from a lower fixed start point since the known max for `n` is 1 mil, not ~4 billion. So that's about 12 / 2 = 6 iterations we could save. But this will probably still be slower than converting to single-precision `float` for `sqrtss` and back. Maybe if you did two square roots in parallel in the integer loop, so the `c` updates and loop overhead would be amortized, and there'd be 2 dep chains – Peter Cordes May 08 '19 at 12:49
  • The binary search inverting the table is an interesting idea, but still probably terrible on modern x86-64 where hardware sqrt is not very slow, but branch mispredicts are very costly relative to designs with shorter / simpler pipelines. Maybe some of this answer will be useful to someone with the same problem but on a microcontroller. – Peter Cordes May 08 '19 at 12:52
2

I'm not sure if algebraic manipulations, in combination with integer arithmetic, necessarily leads to the fastest solution. You'll need many scalar multiplies in that case (which isn't very fast), and/or branch prediction may fail, which may degrade performance. Obviously you'll have to benchmark to see which solution is fastest in you particular case.

One method to make the sqrt a bit faster is to add the -fno-math-errno option to gcc or clang. In that case the compiler doesn't have to check for negative inputs. With icc this the default setting.

More performance improvement is possible by using the vectorized sqrt instruction sqrtpd, instead of the scalar sqrt instruction sqrtsd. Peter Cordes has shown that clang is able to auto vectorize this code, such that it generates this sqrtpd.

However the amount success of auto vectorization depends quite heavily on the right compiler settings and the compiler that is used (clang, gcc, icc etc.). With -march=nehalem, or older, clang doesn't vectorize.

More reliable vectorization results are possible with the following intrinsics code, see below. For portability we only assume SSE2 support, which is the x86-64 baseline.

/* gcc -m64 -O3 -fno-math-errno smaller.c                      */
/* Adding e.g. -march=nehalem or -march=skylake might further  */
/* improve the generated code                                  */
/* Note that SSE2 in guaranteed to exist with x86-64           */
#include<immintrin.h>
#include<math.h>
#include<stdio.h>
#include<stdint.h>

int is_smaller_v5(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    uint64_t a64    =  (((uint64_t)a2)<<32) | ((uint64_t)a1); /* Avoid too much port 5 pressure by combining 2 32 bit integers in one 64 bit integer */
    uint64_t b64    =  (((uint64_t)b2)<<32) | ((uint64_t)b1); 
    __m128i ax      = _mm_cvtsi64_si128(a64);         /* Move integer from gpr to xmm register                  */
    __m128i bx      = _mm_cvtsi64_si128(b64);         
    __m128d a       = _mm_cvtepi32_pd(ax);            /* Convert 2 integers to double                           */
    __m128d b       = _mm_cvtepi32_pd(bx);            /* We don't need _mm_cvtepu32_pd since a,b < 1e6          */
    __m128d sqrt_b  = _mm_sqrt_pd(b);                 /* Vectorized sqrt: compute 2 sqrt-s with 1 instruction   */
    __m128d sum     = _mm_add_pd(a, sqrt_b);
    __m128d sum_lo  = sum;                            /* a1 + sqrt(b1) in the lower 64 bits                     */
    __m128d sum_hi  =  _mm_unpackhi_pd(sum, sum);     /* a2 + sqrt(b2) in the lower 64 bits                     */
    return _mm_comilt_sd(sum_lo, sum_hi);
}


int is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    return a1+sqrt(b1) < a2+sqrt(b2);
}


int main(){
    unsigned a1; unsigned b1; unsigned a2; unsigned b2;
    a1 = 11; b1 = 10; a2 = 10; b2 = 10;
    printf("smaller?  %i  %i \n",is_smaller(a1,b1,a2,b2), is_smaller_v5(a1,b1,a2,b2));
    a1 = 10; b1 = 11; a2 = 10; b2 = 10;
    printf("smaller?  %i  %i \n",is_smaller(a1,b1,a2,b2), is_smaller_v5(a1,b1,a2,b2));
    a1 = 10; b1 = 10; a2 = 11; b2 = 10;
    printf("smaller?  %i  %i \n",is_smaller(a1,b1,a2,b2), is_smaller_v5(a1,b1,a2,b2));
    a1 = 10; b1 = 10; a2 = 10; b2 = 11;
    printf("smaller?  %i  %i \n",is_smaller(a1,b1,a2,b2), is_smaller_v5(a1,b1,a2,b2));

    return 0;
}


See this Godbolt link for the generated assembly.

In a simple throughput test on Intel Skylake, with compiler options gcc -m64 -O3 -fno-math-errno -march=nehalem, I found a throughput of is_smaller_v5() which was 2.6 times better than the original is_smaller(): 6.8 cpu cycles vs 18 cpu cycles, with loop overhead included. However, in a (too?) simple latency test, where the inputs a1, a2, b1, b2 depended on the result of the previous is_smaller(_v5), I didn't see any improvement. (39.7 cycles vs 39 cycles).

wim
  • 3,352
  • 14
  • 18
  • clang already auto-vectorizes like that :P https://godbolt.org/z/GvNe2B see the double and signed int version. But only for `double`, not `float`. **For throughput you should *definitely* use `float`** with this strategy, because packed conversion is only 1 uop, and because `sqrtps` has better throughput. The OP's numbers are all 1million or less, so can be exactly represented by `float`, and so can their square roots. And BTW, looks like you forgot to set `-mtune=haswell`, so your gcc picked a store/reload strategy for `_mm_set_epi32` instead of ALU `movd`. – Peter Cordes May 08 '19 at 22:29
  • @PeterCordes: Single precision isn't accurate enough, see my comment [here](https://stackoverflow.com/questions/56037046/comparing-two-values-in-the-form-a-sqrtb-as-fast-as-possible/56048240?noredirect=1#comment98739750_56037046). We only know that the target is x86-64. Somehow clang doesn't vectorize in that case.Even with `-march=nehalem`. Clang produces better asm with 4 `movd`s indeed. – wim May 08 '19 at 22:39
  • 1
    @PeterCordes: Note that in a throughput test the auto vectorized function may easily bottleneck on port 5. Clang [generates](https://godbolt.org/z/Chhkc3) 9 p5 micro-ops (Skylake), if I counted right. – wim May 10 '19 at 13:11
  • I didn't look that closely. Not surprised it wasn't optimal. :P Interesting, I hadn't realized there was an intrinsic for `(u)comisd`. Makes sense that there is, of course, but I'd never noticed it before. You should be able to save a `movaps` if you use `movhlps` into a "dead" variable instead of `unpckhpd` (if you compile without AVX). But that requires a bunch of casting because intrinsics make it inconvenient to help the compiler optimize its shuffles that way. – Peter Cordes May 10 '19 at 13:19
  • @PeterCordes Actually, I have never used a `(u)comisd` intrinsic before, but here it seem useful. – wim May 10 '19 at 13:30
1

Possibly not better than other answers, but uses a different idea (and a mass of pre-analysis).

// Compute approximate integer square root of input in the range [0,10^6].
// Uses a piecewise linear approximation to sqrt() with bounded error in each piece:
//   0 <= x <= 784 : x/28
//   784 < x <= 7056 : 21 + x/112
//   7056 < x <= 28224 : 56 + x/252
//   28224 < x <= 78400 : 105 + x/448
//   78400 < x <= 176400 : 168 + x/700
//   176400 < x <= 345744 : 245 + x/1008
//   345744 < x <= 614656 : 336 + x/1372
//   614656 < x <= 1000000 : (784000+x)/1784
// It is the case that sqrt(x) - 7.9992711366390365897... <= pseudosqrt(x) <= sqrt(x).
unsigned pseudosqrt(unsigned x) {
    return 
        x <= 78400 ? 
            x <= 7056 ?
                x <= 764 ? x/28 : 21 + x/112
              : x <= 28224 ? 56 + x/252 : 105 + x/448
          : x <= 345744 ?
                x <= 176400 ? 168 + x/700 : 245 + x/1008
              : x <= 614656 ? 336 + x/1372 : (x+784000)/1784 ;
}

// known pre-conditions: a1 < a2, 
//                  0 <= b1 <= 1000000
//                  0 <= b2 <= 1000000
bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
// Try three refinements:
// 1: a1 + sqrt(b1) <= a1 + 1000, 
//    so is a1 + 1000 < a2 ?  
//    Convert to a2 - a1 > 1000 .
// 2: a1 + sqrt(b1) <= a1 + pseudosqrt(b1) + 8 and
//    a2 + pseudosqrt(b2) <= a2 + sqrt(b2), 
//    so is  a1 + pseudosqrt(b1) + 8 < a2 + pseudosqrt(b2) ?
//    Convert to a2 - a1 > pseudosqrt(b1) - pseudosqrt(b2) + 8 .
// 3: Actually do the work.
//    Convert to a2 - a1 > sqrt(b1) - sqrt(b2)
// Use short circuit evaluation to stop when resolved.
    unsigned ad = a2 - a1;
    return (ad > 1000)
           || (ad > pseudosqrt(b1) - pseudosqrt(b2) + 8)
           || ((int) ad > (int)(sqrt(b1) - sqrt(b2)));
}

(I don't have a compiler handy, so this probably contains a typo or two.)

Eric Towers
  • 4,045
  • 1
  • 13
  • 16
  • It does compile with `#include ` (https://godbolt.org/z/VH4I3g), but I don't think it will do very well. @Geza's answer has a *much* faster integer early-out, and does less integer math than this (with less branching, too) to fully avoid `sqrt` by squaring while only requiring 64-bit integers. – Peter Cordes May 08 '19 at 22:33
  • @PeterCordes : I don't have the means to do a timing comparison. Geza's early out handles 99.946% of uniformly distributed inputs and follows two `long` subtracts and a `long long` multiply, add, and compare. My early out is an `unsigned` subtract and compare. I don't see the basis of "*much* faster integer early-out". My early out handles 99.8% of uniformly distributed inputs. The `pseudosqrt()` case raises the handled set of inputs to 99.9725% costing additional 7 `unsigned` compares and one each of `unsigned` add and subtract. (continued) – Eric Towers May 08 '19 at 23:09
  • @PeterCordes : Unless `long long` operations are as fast as `unsigned` operations now, your "*much* faster" claim would be surprising. THis seems to not be the case. [https://stackoverflow.com/questions/48779619/why-do-arithmetic-operations-on-long-long-int-take-more-time-than-for-an-int] – Eric Towers May 08 '19 at 23:13
  • 1
    compare-and-branch is expensive unless branch prediction works perfectly. *much* more expensive than a `long long` multiply (3c latency, 1 cycle throughput on modern x86-64, like AMD Zen or Intel since Nehalem). **Only division costs more for wider types on modern x86-64, other operations are not data-dependent or type-width dependent.** A few older x86-64 CPUs like Bulldozer-family or Silvermont have slower 64-bit multiply. https://agner.org/optimize/. (We're of course talking about scalar; auto-vectorizing with SIMD makes narrow types valuable because you can do more per vector) – Peter Cordes May 08 '19 at 23:14
  • @PeterCordes : Then Geza's early out is 4 arithmetic operations and a compare and mine is 1 arithmetic operation and a compare. As far as I can tell, 1 is still much less than 4. – Eric Towers May 08 '19 at 23:16
  • That's a fair point, and I hadn't looked at the math for how much of the problem space each one covers. If 99.8% happens in practice, that would probably justify the simpler early-out check. But we don't know if the OP's inputs will be uniformly distributed or not, though. Geza's first version *does* have the `ad > 1000` early out, but [I suggested *removing* it](https://stackoverflow.com/questions/56037046/comparing-two-values-in-the-form-a-sqrtb-as-fast-as-possible/56047494#comment98722153_56039869) for better branch prediction. – Peter Cordes May 08 '19 at 23:23
  • @PeterCordes : I've been using Mathematica to test correctness and determine the fractions of the input space handled by various collections of conditions. What I can't do is any meaningful timing. – Eric Towers May 08 '19 at 23:27
  • Superscalar out-of-order execution CPUs don't always bottleneck on instruction throughput; the extra instructions might essentially no cost if they can execute in the shadow of a latency dependency chain. (i.e. if the surrounding code doesn't max out instruction-per-cycle, there is room for burst of extra instructions to execute without slowing anything else down.) Counting instructions doesn't always predict performance. Fewer uops executed is almost always better, but smaller static code size is nice too. – Peter Cordes May 08 '19 at 23:27
  • @PeterCordes : Then my code may still be in good shape. I expect two `long long` constants in Geza's code and 23 `unsigned`s in mine (since I expect any vaguely optimizing compiler to inline `pseudosqrt()`). – Eric Towers May 08 '19 at 23:31
  • So anyway, it would be an interesting experiment to test Geza's final version with/without adding in an `ad > 1000` early out before (or more likely instead of) the current early out. But nobody can do that without knowing how the OP's actual inputs are distributed. Numbers in computer programs are rarely uniform, and this is the critical factor in which early out will filter best and how well branch prediction will do on a real CPU. (Past branch history is part of the input to predicting a branch, so if different numbers are a reasult of different branching earlier, it might predict...) – Peter Cordes May 08 '19 at 23:32
  • Huh, your comment about number of constants makes no sense. Geza's `2ll * ...` and `4ll * ...` don't need to be stored as a 64-bit `0x000000000002`. It's a shift by 1, and multiply by 4 is a shift by 2. And x86-64 can do a shift-and-add in one instruction with `lea rax, [rdi + 2*rcx]` for example, so the `*2` and `+` combine into one cheap instruction. Code-size is mostly instructions, although yours will need a lot of 32-bit immediate constants to compare against. The Godbolt link in my first comment shows that it's a lot of total code size, even if most of it rarely executes. – Peter Cordes May 08 '19 at 23:37
  • @PeterCordes: I'm not entirely convinced that Geza's math is entirely right however. Probably it is easy to repair. I didn't look into the details. – wim May 09 '19 at 00:03
  • 1
    @wim: yeah, whatever the OP does, a fairly exhaustive unit test is in order! (1M ^4 is too expensive, though, so some pruning of the search space to look at some large values, and some values that make the two sides of the inequality nearly equal, is needed.) – Peter Cordes May 09 '19 at 00:06
  • @EricTowers: It would be interesting to edit your answer to include your early-out analysis based on a uniform distribution. I think that's a valuable piece of analysis (the results, and the idea / method of even doing that in the first place instead of just guessing how good different early-outs will be). But still with the huge caveat that most code doesn't deal with uniformly-distributed numbers; small numbers are very common most of the time. – Peter Cordes May 09 '19 at 00:45