Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math

Question

Does anyone know why GCC/Clang will not optimist function test1 in the below code sample to simply use just the RCPPS instruction when using the fast-math option? Is there another compiler flag that would generate this code?

typedef float float4 __attribute__((vector_size(16)));

float4 test1(float4 v)
{
    return 1.0f / v;
}

You can see the compiled output here: https://goo.gl/jXsqat

Peter Cordes · Accepted Answer · 2020-11-29T01:43:11.213

Because the precision of RCPPS is a lot lower than float division.

An option to enable that optimization would not be appropriate as part of -ffast-math.

The x86 target options of the gcc manual says there in fact is an option that (with -ffast-math) does get gcc to use them (with a Newton-Raphson iteration - Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision / Newton Raphson with SSE2 - can someone explain me these 3 lines - SIMD and scalar have basically the same performance per instruction, and Newton-iteration math is the same):

-mrecip This option enables use of RCPSS and RSQRTSS instructions (and their vectorized variants RCPPS and RSQRTPS) with an additional Newton-Raphson step to increase precision instead of DIVSS and SQRTSS (and their vectorized variants) for single-precision floating-point arguments. These instructions are generated only when -funsafe-math-optimizations is enabled together with -finite-math-only and -fno-trapping-math. Note that while the throughput of the sequence is higher than the throughput of the non-reciprocal instruction, the precision of the sequence can be decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994).

Note that GCC implements 1.0f/sqrtf(x) in terms of RSQRTSS (or RSQRTPS) already with -ffast-math (or the above option combination), and doesn't need -mrecip.

Also note that GCC emits the above sequence with additional Newton-Raphson step for vectorized single-float division and vectorized sqrtf(x) already with -ffast-math (or the above option combination), and doesn't need -mrecip.

-mrecip=opt

This option controls which reciprocal estimate instructions may be used. opt is a comma-separated list of options, which may be preceded by a ‘!’ to invert the option:
’all’
      Enable all estimate instructions.
‘default’
    Enable the default instructions, equivalent to -mrecip.
‘none’
    Disable all estimate instructions, equivalent to -mno-recip.
‘div’
    Enable the approximation for scalar division.
‘vec-div’
    Enable the approximation for vectorized division.
‘sqrt’
    Enable the approximation for scalar square root.
‘vec-sqrt’
    Enable the approximation for vectorized square root. 
So, for example, -mrecip=all,!sqrt enables all of the reciprocal approximations, except for square root.

Note that Intel's new Skylake design further improves FP division performance, to 8-11c latency, 1/3c throughput. (Or one per 5c throughput for 256b vectors, but same latency for vdivps). They widened the dividers, so AVX vdivps ymm is now the same latency as for 128b vectors.

(SnB to Haswell did 256b div and sqrt with about twice the latency / recip-throughput, so they clearly only had 128b-wide dividers.) Skylake also pipelines both operations more, so about 4 div operations can be in flight. sqrt is faster, too.

So in several years, once Skylake is widespread, it'll only be worth doing rcpps if you need to divide by the same thing multiple times. rcpps and a couple fma might possibly have slightly higher throughput but worse latency. Also, vdivps is only a single uop; so more execution resources will be available for things to happen at the same time as the division.

It remains to be seen what the initial implementation of AVX512 will be like. Presumably rcpps and a couple FMAs for Newton-Raphson iterations will be a win if FP division performance is a bottleneck. If uop throughput is a bottleneck and there's plenty of other work to do while the divisions are in flight, vdivps zmm is probably still good (unless the same divisor is used repeatedly, of course).

Floating point division vs floating point multiplication has more about FP throughput vs. latency.

+1 Nice find on the Skylake latency numbers. I've been waiting for Agner Fog to post them, but I guess this will have to do until then. — Mysticial, Aug 14 '15 at 14:10
@Mysticial: Found in this RWT thread: http://www.realworldtech.com/forum/?threadid=151723&curpostid=151877. See also http://www.realworldtech.com/forum/?threadid=151723&curpostid=151724, http://www.realworldtech.com/forum/?threadid=151723&curpostid=152052 — Peter Cordes, Aug 14 '15 at 22:18
I OCD'ed with that latency table today. And I'm coming to the conclusion that ports 0 and 1 are virtually identical for Skylake. Both have a 4-cycle FPU, 1-cycle vector int-add/shift, 5 cycle vector int-mul. Though I don't find any evidence that divider is on both or just port 0. We'll see next week how right/wrong I am. — Mysticial, Aug 14 '15 at 22:26
Dunno if you've seen it already, but page 33: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf Seems like I was spot on with my prediction. Ports 0 and 1 are indeed identical with respect to the "important" vector stuff. The divider stays on port 0. And the scalar integer multiply stays on port 1. So I imagine most (if not all of) the vector unit port 0/1 are identical and Intel is doing a copy/paste of the design into top half of AVX512. — Mysticial, Nov 10 '15 at 02:21
@Mysticial: Neat. I've been disappointed that we won't see desktop skylake with eDRAM any time soon if at all, as well as how far away AVX512 is. But anyway, we don't know if first-gen AVX512 CPUs will widen most/all the execution units. They might decode some/all 512b operations into two uops, the way P6 before Core2 handled 128b SSE ops as two 64bit uops. Division will almost certainly have less throughput for wider vectors (true even for Skylake), and prob. also higher latency (true for Broadwell, but not skylake). — Peter Cordes, Nov 10 '15 at 05:45
I have some confidence that the first gen AVX512 will be full width as was the case with Sandy Bridge. Knights Corner already had full width 512-bit vectors and it's even more tight on silicon real-estate. The vector divider certainly won't be full width on AVX512. The full precision divider is probably a 1 or 2 wide execution unit that is separate from the vector unit like the x87 FPU. (There's a lot of evidence to suggest that x87 FPU with it's 64-bit precision is now a completely separate execution unit on Skylake.) — Mysticial, Nov 10 '15 at 16:12
And if you want a fast vector divide I believe you're supposed to use low precision reciprocal instructions (which are probably full-width) and then iterate one or two Newton Iterations with FMA. — Mysticial, Nov 10 '15 at 16:14
If I'm not mistaken, your quote from the manual says that using -mrecip (or just -ffast-math) will emit rcpps + a Newton iteration for vector math, which is exactly the code gen OP showed. Or am I misreading it? I take it there is no way to get gcc to emit only rcpps (without a Newton iteration)? — JSQuareD, Nov 28 '20 at 11:46
@JSQuareD: Not when auto-vectorizing, AFAIK. You can of course manually vectorize with intrinsics like `_mm_rcp_ps`. It would be plausible to have an option to use very inaccurate raw `rcpps` results, especially for the AVX512 version `vrcp14ps` which has more bits correct, but I don't think one exists. (Fun fact: AVX-512ER on Xeon Phi has `vrcp28ps` / `vrsqrt28ps` and `pd` which *are* accurate enough to use directly for fast-math, especially for float. IDK if GCC will do so.) — Peter Cordes, Nov 28 '20 at 12:32

JSQuareD · Answer 2 · 2020-11-29T01:42:38.440

I was experimenting with a floating point math-heavy hot path in one of my applications, and found something similar. I don't usually look at the instructions emitted by my compiler, so I was a bit surprised and dug into the mathematical details.

Here's the set of instructions generated by gcc, annotated by me with the carried computation:

test1(float __vector): ; xmm0               = a
    rcpps   xmm1, xmm0 ; xmm1 = 1 / xmm0    = 1/a
    mulps   xmm0, xmm1 ; xmm0 = xmm0 * xmm1 = a * 1/a
    mulps   xmm0, xmm1 ; xmm0 = xmm0 * xmm1 = a * (1/a)^2
    addps   xmm1, xmm1 ; xmm1 = xmm1 + xmm1 = 2 * (1/a)
    subps   xmm1, xmm0 ; xmm1 = xmm1 - xmm0 = 2 * (1/a) - a * (1/a)^2
    movaps  xmm0, xmm1 ; xmm0 = xmm1        = 2 * (1/a) - a * (1/a)^2
    ret

So what's going on here? Why waste an additional 4 instructions on calculating an expression that is mathematically equivalent to just the reciprocal?

Well, the rcpps instructions is only an approximate reciprocal. The other arithmetic instructions (mulps, addps, subps) are exact up to single precision. Let's write r(x) for the approximate reciprocal function. The final then becomes y = 2*r(a) - a*r(a)^2. If we substitute r(x) = (1 + eps) * (1/x), with eps being the relative error, we get:

y = 2 * (1 + eps) * (1/a) - a * (1 + eps)^2 * (1/a)^2
  = (2 + 2*eps - (1 + eps)^2) * (1/a)
  = (2 + 2*eps - (1 + 2*eps + eps^2)) * (1/a)
  = (1 - eps^2) * (1/a)

The relative error of rcpps is less than 1.5 * 2^-12, so eps <= 1.5 * 2^-12, so:

eps^2 <= 2.25 * 2^-24
      <  1.5  * 2^-23

So by executing these extra instructions we went from 12 bits of precision to 23 bits of precision. Note that a single precision float has 24 bits of precision, so we almost get full precision here.

So is this just some magical sequence of instructions that happens to get us extra precision? Not quite. It's based on Newton's method (which I gather is referred to as Newton-Raphson by folks who work with assembly a lot).

Newton's method is a root-finding method. Given some function f(x) it finds approximate solutions to f(x) = 0, by starting with an approximate solution x_0 and iteratively improving upon it. The Newton iteration is given by:

x_n+1 = x_n - f(x_n) / f'(x_n)

In our case, we can reformulate finding the reciprocal 1/a of a as finding the root of the function f(x) = a*x - 1, with derivative f'(x) = a. Substituting that into the equation for the Newton iteration we get:

x_n+1 = x_n - (a*x_n - 1) / a

Two observations:

In this case the Newton iteration actually gives us an exact result, rather than just a better approximation. This makes sense, because Newton's method works by making a linear approximation of f around x_n. In this case f is completely linear, so the approximation is perfect. However...
Computing the Newton iteration requires us to divide by a, which is the exact computation we're trying to approximate. This creates a circular problem. We break the cycle by modifying the Newton iteration to use our approximate reciprocal x_n for the division by a:
```
x_n+1 =  x_n - (a*x_n - 1) * x_n
      ~= x_n - (a*x_n - 1) / a
```

This iteration would work just fine, but it's not great from a vector math perspective: it requires subtracting 1. To do so with vector math requires preparing a vector register with a sequence of 1s. This requires an additional instruction and an additional register.

We can rewrite the iteration to avoid this:

x_n+1 = x_n - (a*x_n - 1) * x_n
      = x_n - (a*x_n^2 - x_n)
      = 2*x_n - a*x_n^2

Now substitute x_0 = r(a) and we recover our expression from above:

y = x_1 = 2*r(a) - a*r(a)^2

Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math

2 Answers2

Floating point division vs floating point multiplication has more about FP throughput vs. latency.

Linked