Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

Question

I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!

I'm testing it with a loop something like:

inline float TestSqrtFunction( float in );

void TestFunc()
{
  #define ARRAYSIZE 4096
  #define NUMITERS 16386
  float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
  float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache

  cyclecounter.Start();
  for ( int i = 0 ; i < NUMITERS ; ++i )
    for ( int j = 0 ; j < ARRAYSIZE ; ++j )
    {
       flOut[j] = TestSqrtFunction( flIn[j] );
       // unrolling this loop makes no difference -- I tested it.
    }
  cyclecounter.Stop();
  printf( "%d loops over %d floats took %.3f milliseconds",
          NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}

I've tried this with a few different bodies for the TestSqrtFunction, and I've got some timings that are really scratching my head. The worst of all by far was using the native sqrt() function and letting the "smart" compiler "optimize". At 24ns/float, using the x87 FPU this was pathetically bad:

inline float TestSqrtFunction( float in )
{  return sqrt(in); }

The next thing I tried was using an intrinsic to force the compiler to use SSE's scalar sqrt opcode:

inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
   _mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
   // compiles to movss, sqrtss, movss
}

This was better, at 11.9ns/float. I also tried Carmack's wacky Newton-Raphson approximation technique, which ran even better than the hardware, at 4.3ns/float, although with an error of 1 in 2¹⁰ (which is too much for my purposes).

The doozy was when I tried the SSE op for reciprocal square root, and then used a multiply to get the square root ( x * 1/√x = √x ). Even though this takes two dependent operations, it was the fastest solution by far, at 1.24ns/float and accurate to 2^-14:

inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
   __m128 in = _mm_load_ss( pIn );
   _mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
   // compiles to movss, movaps, rsqrtss, mulss, movss
}

My question is basically what gives? Why is SSE's built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations?

I'm sure that this is really the cost of the op itself, because I've verified:

All data fits in cache, and accesses are sequential
the functions are inlined
unrolling the loop makes no difference
compiler flags are set to full optimization (and the assembly is good, I checked)

(edit: stephentyrone correctly points out that operations on long strings of numbers should use the vectorizing SIMD packed ops, like rsqrtps — but the array data structure here is for testing purposes only: what I am really trying to measure is scalar performance for use in code that can't be vectorized.)

You do appreciate, btw, that x * (1 / sqrt( x )) = x / sqrt(x) and NOT sqrt( x ) like you suggest ... — Goz, Oct 29 '09 at 14:18
x / sqrt(x) = sqrt(x). Or, put another way: x^1 * x^(-1/2) = x^(1 - 1/2) = x^(1/2) = sqrt(x) — Crashworks, Oct 29 '09 at 17:13
Is there a way to write `SSESqrt` so that it takes a `float` and returns a `float` instead of taking 2 pointers? — Gregory Pakosz, Feb 16 '10 at 10:26
of course, `inline float SSESqrt( float restrict fIn ) { float fOut; _mm_store_ss( &fOut, _mm_sqrt_ss( _mm_load_ss( &fIn ) ) ); return fOut; }`. But this is a bad idea because it can easily induce a load-hit-store stall if the CPU writes the floats to the stack and then reads them back immediately -- juggling from the vector register to a float register for the return value in particular is bad news. Besides, the underlying machine opcodes that the SSE intrinsics represent take address operands anyway. — Crashworks, Feb 16 '10 at 19:35
I have a `normalize(Vector3f& v)` function that does `v *= invSqrt(norm2(v));`. Does it mean I'm facing 2 LHS situations: first when `invSqrt` reads the result of `norm2`, second when `*=` reads the result of `invSqrt`? (and do I really need to worry about it on x86?) — Gregory Pakosz, Feb 16 '10 at 20:29
How much LHS matters depends on the particular gen and stepping of a given x86: my experience is that on anything up to i7, moving data between register sets (eg FPU to SSE to `eax`) is very bad, while a round trip between xmm0 and stack and back isn't, because of Intel's store-forwarding. You can time it yourself to see for sure. Generally the easiest way to see potential LHS is to look at the emitted assembly and see where data is juggled between register sets; your compiler might do the smart thing, or it might not. As to normalizing vectors, I wrote up my results here: http://bit.ly/9W5zoU — Crashworks, Feb 16 '10 at 23:44
Thank you for the follow up. I'm eager to learn more about LHS and optimization. Are there tools out there that could diagnosis such problems in code? Profiling, static analysis, else? — Gregory Pakosz, Feb 17 '10 at 09:52
For the PowerPC, yes: IBM has a CPU simulator that can predict LHS and many other pipeline bubbles through static analysis. Some PPCs also have a hardware counter for LHS that you can poll. It's harder for the x86; good profiling tools are scarcer (VTune is somewhat broken these days) and the reordered pipelines are less deterministic. You can try to measure it empirically by measuring instructions per cycle, which can be done precisely with the hardware performance counters. The "instructions retired" and "total cycles" registers can be read with eg PAPI or PerfSuite (http://bit.ly/an6cMt). — Crashworks, Feb 17 '10 at 19:52
You can also simply write a few permutations on a function and time them to see if any suffer particularly from stalls. Intel doesn't publish many details about the way their pipelines work (that they LHS at all is sort of a dirty secret), so a lot of what I learned was by looking at a scenario that causes a stall on other archs (eg PPC), and then constructing a controlled experiment to see if the x86 has it as well. — Crashworks, Feb 18 '10 at 07:42
@Crashworks What?! Don't `_mm_store_ss`/`_mm_load_ss`, use `return _mm_cvtss_f32(_mm_sqrt_ss(_mm_set_ss(fIn)));`! There's no need for round-trip to memory. This is particularly useful in code where the compiler is using scalar SSE for math already. — doug65536, Jan 28 '14 at 14:00
@doug65536 The point of the code above is to provide a clear, isolated measurement of the difference between the two options. It is not production code. — Crashworks, Jan 28 '14 at 19:37

Stephen Canon · Accepted Answer · 2009-10-07T00:02:29.253

218

sqrtss gives a correctly rounded result. rsqrtss gives an approximation to the reciprocal, accurate to about 11 bits.

sqrtss is generating a far more accurate result, for when accuracy is required. rsqrtss exists for the cases when an approximation suffices, but speed is required. If you read Intel's documentation, you will also find an instruction sequence (reciprocal square-root approximation followed by a single Newton-Raphson step) that gives nearly full precision (~23 bits of accuracy, if I remember properly), and is still somewhat faster than sqrtss.

edit: If speed is critical, and you're really calling this in a loop for many values, you should be using the vectorized versions of these instructions, rsqrtps or sqrtps, both of which process four floats per instruction.

edited Oct 07 '09 at 00:02

answered Oct 06 '09 at 23:52

Stephen Canon

97,302
18
172
256

3

The n/r step gives you 22-bits of accuracy (it doubles it); 23-bits would be exactly full accuracy. – Jasper Bekkers Jul 12 '11 at 15:51
7

@Jasper Bekkers: No, it wouldn't. First, float has 24 bits of precision. Second, `sqrtss` is *correctly rounded*, which requires ~50 bits before rounding, and cannot be achieved using a simple N/R iteration in single precision. – Stephen Canon Jul 12 '11 at 15:54
1

This is definitely the reason. To extend this result: Intel's Embree project (http://software.intel.com/en-us/articles/embree-photo-realistic-ray-tracing-kernels/), uses vectorization for its mathematics. You can download the source at that link and look at how they do their 3/4 D Vectors. Their vector normalization uses rsqrt followed by an iteration of newton-raphson, which is then very accurate and still faster than 1/ssqrt! – Brandon Pelfrey Jun 24 '12 at 15:29
7

A small caveat: x*rsqrt(x) results in NaN if x is either zero or infinity. 0*rsqrt(0) = 0 * INF = NaN. INF*rsqrt(INF) = INF * 0 = NaN. For this reason, CUDA on NVIDIA GPUs computes approximate single-precision square roots as recip(rsqrt(x)), with the hardware providing both a fast approximation to the reciprocal and the reciprocal square root. Obviously, explicit checks handling the two special cases are also possible (but would be slower on the GPU). – njuffa Aug 04 '12 at 20:37
@BrandonPelfrey In which file did you find the Newton Rhapson step? – fredoverflow Dec 14 '13 at 15:58

score 8 · Answer 2 · edited Apr 25 '18 at 10:46

8

This is also true for division. MULSS(a,RCPSS(b)) is way faster than DIVSS(a,b). In fact it's still faster even when you increase its precision with a Newton-Raphson iteration.

Intel and AMD both recommend this technique in their optimisation manuals. In applications which don't require IEEE-754 compliance, the only reason to use div/sqrt is code readability.

edited Apr 25 '18 at 10:46

It'sNotALie.

20,741
10
63
99

answered Jul 12 '11 at 14:32

Spat

489
5
3

1

Broadwell and later have better FP divide performance, so compilers like clang choose not to use reciprocal + Newton for scalar on recent CPUs, because it's usually *not* faster. In most loops, `div` isn't the only operation, so total uop throughput is often the bottleneck even when there's a `divps` or `divss`. See [Floating point division vs floating point multiplication](//stackoverflow.com/a/45899202), where my answer has a section on why `rcpps` is not a throughput win anymore. (Or a latency win), and numbers on divide throughput/latency. – Peter Cordes Apr 25 '18 at 11:38
If your accuracy requirements are so low that you can skip a Newton iteration, then yes `a * rcpss(b)` can be faster, but it's still more uops than `a/b`! – Peter Cordes Apr 25 '18 at 11:39

Marcin Deptuła · Answer 3 · 2009-10-07T00:14:18.470

Instead of supplying an answer, that actually might be incorrect (I'm also not going to check or argue about cache and other stuff, let's say they are identical) I'll try to point you to the source that can answer your question.
The difference might lie in how sqrt and rsqrt are computed. You can read more here http://www.intel.com/products/processor/manuals/. I'd suggest to start from reading about processor functions you are using, there are some info, especially about rsqrt (cpu is using internal lookup table with huge approximation, which makes it much simpler to get the result). It may seem, that rsqrt is so much faster than sqrt, that 1 additional mul operation (which isn't to costly) might not change the situation here.

Edit: Few facts that might be worth mentioning:
1. Once I was doing some micro optimalizations for my graphics library and I've used rsqrt for computing length of vectors. (instead of sqrt, I've multiplied my sum of squared by rsqrt of it, which is exactly what you've done in your tests), and it performed better.
2. Computing rsqrt using simple lookup table might be easier, as for rsqrt, when x goes to infinity, 1/sqrt(x) goes to 0, so for small x's the function values doesn't change (a lot), whereas for sqrt - it goes to infinity, so it's that simple case ;).

Also, clarification: I'm not sure where I've found it in books I've linked, but I'm pretty sure I've read that rsqrt is using some lookup table, and it should be used only, when the result doesn't need to be exact, although - I might be wrong as well, as it was some time ago :).

score 4 · Answer 4 · edited Oct 05 '12 at 22:08

Newton-Raphson converges to the zero of f(x) using increments equals to -f/f' where f' is the derivative.

For x=sqrt(y), you can try to solve f(x) = 0 for x using f(x) = x^2 - y;

Then the increment is: dx = -f/f' = 1/2 (x - y/x) = 1/2 (x^2 - y) / x which has a slow divide in it.

You can try other functions (like f(x) = 1/y - 1/x^2) but they will be equally complicated.

Let's look at 1/sqrt(y) now. You can try f(x) = x^2 - 1/y, but it will be equally complicated: dx = 2xy / (y*x^2 - 1) for instance. One non-obvious alternate choice for f(x) is: f(x) = y - 1/x^2

Then: dx = -f/f' = (y - 1/x^2) / (2/x^3) = 1/2 * x * (1 - y * x^2)

Ah! It's not a trivial expression, but you only have multiplies in it, no divide. => Faster!

And: the full update step new_x = x + dx then reads:

x *= 3/2 - y/2 * x * x which is easy too.

score 2 · Answer 5 · answered Dec 05 '19 at 01:08

There are a number of other answers to this already from a few years ago. Here's what the consensus got right:

The rsqrt* instructions compute an approximation to the reciprocal square root, good to about 11-12 bits.
It's implemented with a lookup table (i.e. a ROM) indexed by the mantissa. (In fact, it's a compressed lookup table, similar to mathematical tables of old, using adjustments to the low-order bits to save on transistors.)
The reason why it's available is that it is the initial estimate used by the FPU for the "real" square root algorithm.
There's also an approximate reciprocal instruction, rcp. Both of these instructions are a clue to how the FPU implements square root and division.

Here's what the consensus got wrong:

SSE-era FPUs do not use Newton-Raphson to compute square roots. It's a great method in software, but it would be a mistake to implement it that way in hardware.

The N-R algorithm to compute reciprocal square root has this update step, as others have noted:

x' = 0.5 * x * (3 - n*x*x);

That's a lot of data-dependent multiplications and one subtraction.

What follows is the algorithm that modern FPUs actually use.

Given b[0] = n, suppose we can find a series of numbers Y[i] such that b[n] = b[0] * Y[0]^2 * Y[1]^2 * ... * Y[n]^2 approaches 1. Then consider:

x[n] = b[0] * Y[0] * Y[1] * ... * Y[n]
y[n] = Y[0] * Y[1] * ... * Y[n]

Clearly x[n] approaches sqrt(n) and y[n] approaches 1/sqrt(n).

We can use the Newton-Raphson update step for reciprocal square root to get a good Y[i]:

b[i] = b[i-1] * Y[i-1]^2
Y[i] = 0.5 * (3 - b[i])

Then:

x[0] = n Y[0]
x[i] = x[i-1] * Y[i]

and:

y[0] = Y[0]
y[i] = y[i-1] * Y[i]

The next key observation is that b[i] = x[i-1] * y[i-1]. So:

Y[i] = 0.5 * (3 - x[i-1] * y[i-1])
     = 1 + 0.5 * (1 - x[i-1] * y[i-1])

Then:

x[i] = x[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
     = x[i-1] + x[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))
y[i] = y[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
     = y[i-1] + y[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))

That is, given initial x and y, we can use the following update step:

r = 0.5 * (1 - x * y)
x' = x + x * r
y' = y + y * r

Or, even fancier, we can set h = 0.5 * y. This is the initialisation:

Y = approx_rsqrt(n)
x = Y * n
h = Y * 0.5

And this is the update step:

r = 0.5 - x * h
x' = x + x * r
h' = h + h * r

This is Goldschmidt's algorithm, and it has a huge advantage if you're implementing it in hardware: the "inner loop" is three multiply-adds and nothing else, and two of them are independent and can be pipelined.

In 1999, FPUs already needed a pipelined add/substract circuit and a pipelined multiply circuit, otherwise SSE would not be very "streaming". Only one of each circuit was needed in 1999 to implement this inner loop in a fully-pipelined way without wasting a lot of hardware just on square root.

Today, of course, we have fused multiply-add exposed to the programmer. Again, the inner loop is three pipelined FMAs, which are (again) generally useful even if you're not computing square roots.

Related: [How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?](//stackoverflow.com/q/54642663) has some links to hardware div/sqrt execution unit designs. [Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision](//stackoverflow.com/q/31555260) - one Newton iteration in software, with or without FMA, for use with `_mm256_rsqrt_ps`, with Haswell perf analysis. Usually only a good idea if you don't have other work in the loop and would bottleneck hard on divider throughput. HW sqrt is single uop so is ok mixed with other work. — Peter Cordes, Dec 05 '19 at 03:21

score -3 · Answer 6 · answered Jul 05 '16 at 14:17

-3

It is faster becausse these instruction ignore rounding modes, and do not handle floatin point exceptions or dernormalized numbers. For these reasons it is much easier to pipeline, speculate and execute other fp instruction Out of order.

answered Jul 05 '16 at 14:17

Witek

1

Obviously wrong. FMA depends on the current rounding mode, but has a throughput of two per clock on Haswell and later. With two fully-pipelined FMA units, Haswell can have up to 10 FMAs in flight at once. The right answer is `rsqrt`'s *much* lower accuracy, which means much less work to do (or none at all?) after a table-lookup to get a starting guess. – Peter Cordes Jul 06 '16 at 02:02

Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

6 Answers6

Linked