Optimizing Fixed-Point Sqrt

Question

I made what I think is a good fixed-point square root algorithm:

template<int64_t M, int64_t P>
typename enable_if<M + P == 32, FixedPoint<M, P>>::type sqrt(FixedPoint<M, P> f)
{
    if (f.num == 0)
        return 0;
    //Reduce it to the 1/2 to 2 range (based around FixedPoint<2, 30> to avoid left/right shift branching)
    int64_t num{ f.num }, faux_half{ 1 << 29 };
    ptrdiff_t mag{ 0 };
    while (num < (faux_half)) {
        num <<= 2;
        ++mag;
    }

    int64_t res = (M % 2 == 0 ? SQRT_32_EVEN_LOOKUP : SQRT_32_ODD_LOOKUP)[(num >> (30 - 4)) - (1LL << 3)];
    res >>= M / 2 + mag - 1; //Finish making an excellent guess
    for (int i = 0; i < 2; ++i)
        //                            \    |   /
        //                             \   |  /
        //                              _| V L
        res = (res + (int64_t(f.num) << P) / res) >> 1; //Use Newton's method to improve greatly on guess
        //                               7 A r
        //                              /  |  \
        //                             /   |   \
        //                       The Infamous Time Eater
    return FixedPoint<M, P>(res, true);
}

However, after profiling (in release mode) I found out that the division here takes up 83% of the time this algorithm spends. I can speed it up 6x by replacing the division with multiplication, but that's just wrong. I found out that integer division is much slower than multiplication, unfortunately. Is there any way to optimize this?

In case this table is necessary.

const array<int32_t, 24> SQRT_32_EVEN_LOOKUP = {
     0x2d413ccd, //magic numbers calculated by taking 0.5 + 0.5 * i / 8 from i = 0 to 23, multiplying by 2^30, and converting to hex
     0x30000000,
     0x3298b076,
     0x3510e528,
     0x376cf5d1,
     0x39b05689,
     0x3bddd423,
     0x3df7bd63,
     0x40000000,
     0x41f83d9b,
     0x43e1db33,
     0x45be0cd2,
     0x478dde6e,
     0x49523ae4,
     0x4b0bf165,
     0x4cbbb9d6,
     0x4e623850,
     0x50000000,
     0x5195957c,
     0x532370b9,
     0x54a9fea7,
     0x5629a293,
     0x57a2b749,
     0x59159016
};

SQRT_32_ODD_LOOKUP is just SQRT_32_EVEN_LOOKUP divided by sqrt(2).

You would get more attention to this question if you included a programming language tag like C++ — OneCricketeer, Mar 18 '16 at 06:22
try [sqrt by binary search without multiplication](http://stackoverflow.com/a/34657972/2521214) you need to port it to your fixed point (should be just a matter of changing LUT values) I think it should be faster but you need to measure it on your platform — Spektre, Mar 18 '16 at 06:36
[Fast fixed point pow, log, exp and sqrt](http://stackoverflow.com/q/4657468/995714), [Looking for an efficient integer square root algorithm for ARM Thumb2](http://stackoverflow.com/q/1100090/995714) — phuclv, Mar 18 '16 at 06:37
Have you compared the performance with [my algorithm](https://github.com/chmike/fpsqrt) using only adds,subs and shifts ? No multiplication and table lookup. — chmike, Mar 19 '21 at 14:46

score 1 · Accepted Answer · answered Mar 18 '16 at 09:24

1

Reinventing the wheel, really, and not in a good way. The correct solution is to calculate 1/sqrt(x) using NR, and then multiply once to get x/sqrt(x) - just check for x==0 up front.

The reason why this is so much better is that the NR step for y=1/sqrt(x) is just y = (3y-x*y*y)*y/2. That's all straightforward multiplication.

answered Mar 18 '16 at 09:24

MSalters

159,923
8
140
320

I think that should be `y = (3 - x*y*y)*y/2`. – user155698 Mar 18 '16 at 15:31
Thanks for the tip. I initially thought this wasn't going to work because the inverse square root would be on the other side of 1 (very bad for `FixedPoint<2, 30>` or `FixedPoint<32, 0>`). However, offsetting the inverse square root using `y = (3 - x*y*y/C^2)*y/2` to get `y` between 2^29 and 2^31 worked. It's a little less accurate than the old algorithm but it's 3 times faster. – user155698 Mar 18 '16 at 19:26
@user155698: You can use a 4th NR step and still be 2 times faster. – MSalters Mar 19 '16 at 11:27
That’s not a fixed point `sqrt`. – chmike Mar 19 '21 at 14:44
@chmike: If you do `(3y-x*y*y)*y/2` in fixed point, and `x*(1/sqrt(x))` in fixed point, then the result is `sqrt(x)` in fixed point. – MSalters Mar 19 '21 at 15:09
@MSalters I looked up how to compute 1/sqrt(x), but the quake trick (approximation) uses floating point. What does (3y-x\*y\*y)\*y/2 yield ? Is that 1/sqrt(x) ? I’m curious to see if it’s faster than [my algorithm](https://github.com/chmike/fpsqrt) using no multiplications and divisions. I’ll compare. – chmike Mar 19 '21 at 15:12
@chmike: The quake trick uses that float/integer trick just to get an initial estimation, and then uses Newton-Raphson to refine that estimation. `sqrt(x)` is well-behaved enough that `y=1.0` is a workable initial estimate (but depending on your fixed-point range, other choices may make more sense). Once you have an initial estimate, (3y-x*y*y)*y/2` yields a better estimate. – MSalters Mar 21 '21 at 18:33

Optimizing Fixed-Point Sqrt

1 Answers1