Is it possible to roll a significantly faster version of sqrt

Question

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.

I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.

MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.

See also here for similar discussion on modf function.

EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?

How are you using it? Built-in functions are likely to be pretty well optimal for the general case, but if you're using it in a more specialized way there's more scope for improvement. — David Thornley, Apr 14 '10 at 13:36
Can you post some code? The best way of optimizing sqrt is to get rid of it, or at least reduce the number of calls to it, which may be be possible. — IVlad, Apr 14 '10 at 13:36
Code is long and complex, soft-body physical modelling from a 3rd party. Not a couple of inner loops doing sqrt where length^2 could be used instead of length :) — Mr. Boy, Apr 14 '10 at 13:41
@Paul... all using the float type, not double. I don't know right now if greater precision loss is acceptable. — Mr. Boy, Apr 14 '10 at 15:52
Check this answer: http://stackoverflow.com/questions/686483/c-vs-c-big-performance-difference/687741#687741 — Hans Passant, Apr 14 '10 at 16:19
Don't use the "fast inverse square root". If you're willing to settle for an approximation, the hardware `rsqrtss` (approximate reciprocal square root) is much faster. — Stephen Canon, Apr 14 '10 at 18:22
See http://stackoverflow.com/questions/31555260/fast-vectorized-rsqrt-and-reciprocal-with-sse-avx-depending-on-precision for a decent version of (approximate) `rsqrtps` + a Newton iteration, giving +/-2ulp for single precision `float`. See http://stackoverflow.com/questions/32002277/why-does-gcc-or-clang-not-optimise-reciprocal-to-1-instruction-when-using-fast-m for `-mrecip` compiler optimization, which is supposed to control automatic use of `rsqrt`, [but doesn't seem to actually do so (only `rcp` for 1/x)](http://stackoverflow.com/a/35737027/224132). — Peter Cordes, Mar 02 '16 at 15:35

James · Accepted Answer · 2021-04-26T11:15:46.943

29

Yes, it is possible even without trickery:

sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.
lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.
caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.

Hello from 11 years in the future.

Considering this still gets occasional votes, I thought I'd add a note about performance, which now even more than then is dramatically limited by memory accesses. You absolutely must use a realistic benchmark (ideally, your whole application) when optimising something like this - the memory access patterns of your application will have a dramatic effect on solutions like lookup tables and caches, and just comparing 'cycles' for your optimised version will lead you wildly astray: it is also very difficult to assign program time to individual instructions, and your profiling tool may mislead you here.

On a related note, consider using simd/vectorised instructions for calculating square roots, like _mm512_sqrt_ps or similar, if they suit your use case.
Take a look at section 15.12.3 of intel's optimisation reference manual, which describes approximation methods, with vectorised instructions, which would probably translate pretty well to other architectures too.

edited Apr 26 '21 at 11:15

answered Apr 14 '10 at 13:42

James

22,556
11
75
125

2

I always find it hard to believe that manually doing even a small number of iterations could be faster than a built-in SQRT instruction... but then I guess SQRT isn't magic, it still does iterations inside. – Mr. Boy Apr 14 '10 at 13:46
Do you have any kind of metrics... how much improvement you have seen? – Mr. Boy Apr 14 '10 at 13:47
4

Milage varies with usage :) You really have to profile your own usage scenario to see what works. Regarding the fsqrt instruction, you might be able to still use it, but speed things up a tiny bit by not checking for error conditions: depends on exactly what assembler your compiler is generating though. – James Apr 14 '10 at 14:08
Use quake sqrt algorithm. It's magic is not in the iteration but the choice of an initial value. However gcc at O3 level implements it so my recommendation is to use that. – rytis Jan 18 '16 at 06:36
@rytis The famous quake algorithm is for the inverse square root (`1/sqrt(x)`), not for `sqrt(x)` – James Jan 18 '16 at 14:23
You're right of course although `1/(1/sqrt(x))` might work as a compromise. On a different note I wonder if you have in mind some kind of Gal's accurate tables' implementation in, say C++ when you mention LUTs? I'd be interested to know more – rytis Jan 18 '16 at 15:46

celion · Answer 2 · 2015-11-11T13:51:24.360

There's a great comparison table here: http://assemblyrequired.crashworks.org/timing-square-root/

Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).

Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.

sbi · Answer 3 · 2014-10-02T10:31:27.477

8

You're very likely to gain more speed improvements by changing your algorithms than by changing their implementations: Try to call sqrt() less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt() you mention are just that: improvements of the algorithm used to calculate a square root.)

Since it is used very often, it is likely that your standard library's implementation of sqrt() is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.

Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.

edited Oct 02 '14 at 10:31

answered Apr 14 '10 at 13:32

sbi

204,536
44
236
426

An example of this might be: if you want to find the nearest thing to you, you can compare the distances squared instead of taking the sqrt of each, since distance*distance > distance. Or you might be able to run a pre-process step which calculates the distance pairs of everything in advance. (Obviously I'm imagining some sort of 2-D or 3-D data structure). – stusmith Apr 14 '10 at 13:40
we're a bit beyond such trivialities in this case, I think actual values are genuinely needed rather than only being used in comparisons. – Mr. Boy Apr 14 '10 at 13:45
2

+1 For the realization that a big improvement to a piece of code used less frequently ends up with almost a negligible improvement in the *Big Picture*. – Thomas Matthews Apr 14 '10 at 17:20
59

Why do people seem to assume that the original poster is retarded? Just answer their question instead of telling them that they shouldn't try to do what they are doing. Maybe they have a good reason for doing what they are doing. 10% of code time is a big chunk of time for one function and deserves optimization if it is simple to do. I can't believe that this unhelpful response got upvoted so much. – Joe Jun 11 '12 at 22:13
@Joe: For one, I wasn't even sure you're talking to me until I read your last sentence. Where did I imply the OP is retarded? I tried to be as helpful as I could. If you don't like this answer, then that's what you can downvote for. OTOH, I see that John hasn't accepted one of the answers talking about improving the function itself either. (BTW, there are easily functions that take 10% of your app's execution time — or even more. `operator new()` for example.) – sbi Jun 12 '12 at 21:08
3

Actually I found this reply to be very well thought out, explained the possibilities well and weighed the pros and cons of improving an implementation vs choosing a different approach that may work better. +1 for a genuinely helpful and well put together answer. – leeor_net Jun 02 '13 at 11:49
5

"you're not asking the right question" should really be a comment rather than an answer, since it doesn't answer the question asked. It's correct but the question is quite specific. – Mr. Boy Oct 02 '14 at 10:28
@John: Your question was _"Is it possible to roll a significantly faster version of sqrt"_ – to which I answered _"I don't think so"_, adding an explanation for _why_ I think the way I do. If you think this doesn't answer your question, then maybe you haven't asked the question you thought you asked? – sbi Oct 02 '14 at 10:34
Small reductions in runtime for a large enough set of data can provide drastic improvements for the user. – Nicolas Holthaus Oct 02 '14 at 10:59
@Nicolaus: Yes, sometimes this happens, but I do not consider this common. IME a single function using 10% of your execution time is a rare thing to find, and the effect of dramatically decreasing such a function's CPU time comes to almost no gain for your application. – sbi Oct 02 '14 at 11:02

score 3 · Answer 4 · answered Apr 14 '10 at 13:44

3

How accurate do you need your sqrt to be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square root function for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).

answered Apr 14 '10 at 13:44

jemfinch

2,805
17
24

How fast is it though? If 1/sqrt is no good and you need sqrt, is the additional division still faster than the normal version? – Mr. Boy Apr 14 '10 at 13:54
@John: They have. That's why they created that function, after all. But that doesn't mean it would help (much) in your case. – sbi Apr 14 '10 at 15:45
@John: I can't test on your system, and system variation with the sort of floating point munging done in the referenced function is too significant a variable to ignore. – jemfinch Apr 14 '10 at 16:26
4

The so-called "fast inverse square root" is *not* "fast" on modern hardware. On nearly any processor designed in the last 10 years, there is a faster alternative. On many, the hardware square root instruction will be faster. Many have an even faster hardware inverse square root estimate (`rsqrtss` on SSE, `rsqrte` on ARMv7, etc). – Stephen Canon Apr 14 '10 at 18:20
@jemfinch... it's VS2008, running on typical PCs. Not just my PC. – Mr. Boy Apr 14 '10 at 19:13

score 1 · Answer 5 · answered Nov 23 '13 at 16:37

1

Don't know if you fixed this, but I've read about it before, and it seems that the fastest thing to do is replace the sqrt function with an inline assembly version;

you can see a description of a load of alternatives here.

The best is this snippet of magic:

double inline __declspec (naked) __fastcall sqrt(double n)
{
    _asm fld qword ptr [esp+4]
    _asm fsqrt
    _asm ret 8
}

It's about 4.7x faster than the standard sqrt call with the same precision.

answered Nov 23 '13 at 16:37

will

8,940
6
40
63

Could you please suggest also the implementation in GCC? How should this modified for a float type? Thanks! – rytis Apr 30 '14 at 07:01
8

Then why isn't the compiler doing this? – Mr. Boy Oct 02 '14 at 10:29
I really don't know. – will Oct 02 '14 at 15:48
3

What the hell? That won't autovectorize, and will probably break when inlined (`esp+4` might not be `n`, and there's a `ret` inside the inline asm). It forces your input to go through a store/reload. That's another ~5 cycles of latency on top of 10-23 cycles for `fsqrt` on Haswell. `sqrtpd` (double) is 16c latency, while `sqrtps` (single) is 11c latency. Your compiler would have to really suck for this to be a win (maybe you compiled without optimization on?) Also, `ret 8`? That's a 3uop instruction. – Peter Cordes Mar 02 '16 at 01:40
I didn't write the artical, so i'm not sure at the moment (not without going back and checking it) how he compiled and ran the tests. It surprises me too. – will Mar 03 '16 at 16:10
@will the test are not accurate because he for instance compares `sqrt(int)` to `std::sqrt(double)` which is wrong, internally std::sqrt algorithim speed is not the same for all argument types. btw, I reimplemented his test, and only 2 or 3 varaiants are somewhat faster but accuracy is greatly sarificed comparing to std variants.. – metablaster Sep 27 '19 at 16:12

score 1 · Answer 6 · answered Nov 11 '15 at 18:12

Here is a fast way with a look up table of only 8KB. Mistake is ~0.5% of the result. You can easily enlarge the table, thus reducing the mistake. Runs about 5 times faster than the regular sqrt()

// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11;                       // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory. 
const int nUnusedBits   = 23 - nBitsForSQRTprecision;       // Amount of bits we will disregard
const int tableSize     = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize]; 
static unsigned char is_sqrttab_initialized = FALSE;        // Once initialized will be true

// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
    unsigned short i;
    float f;
    UINT32 *fi = (UINT32*)&f;

    if (is_sqrttab_initialized)
        return;

    const int halfTableSize = (tableSize>>1);
    for (i=0; i < halfTableSize; i++){
         *fi = 0;
         *fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127

         // Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
         f = sqrtf(f);
         sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);

         // Repeat the process, this time with an exponent of 1, stored as 128
         *fi = 0;
         *fi = (i << nUnusedBits) | (128 << 23);
         f = sqrtf(f);
         sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
    }
    is_sqrttab_initialized = TRUE;
}

// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
    if (n <= 0.f) 
        return 0.f;                           // On 0 or negative return 0.
    UINT32 *num = (UINT32*)&n;
    short e;                                  // Exponent
    e = (*num >> 23) - 127;                   // In 'float' the exponent is stored with 127 added.
    *num &= 0x7fffff;                         // leave only the mantissa 

    // If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
    const int halfTableSize = (tableSize>>1);
    const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
    if (e & 0x01) 
        *num |= secondHalphTableIdBit;  
    e >>= 1;                                  // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands

    // Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
    *num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
    return n;
}

I expect cache misses will make this worse than `sqrt` for use in real code, rather than code that tests the throughput of sqrt in a microbenchmark. x86 has very fast `sqrt` in hardware, and even faster `rsqrt` (approximate reciprocal sqrt), which you can refine with a newton-raphson step. — Peter Cordes, Mar 02 '16 at 01:42

Is it possible to roll a significantly faster version of sqrt

6 Answers6

Linked