Why is fp division op slower than reciprocal op plus multiply op

Question

According to Agner's instruction tables, a single fp division is slower than a single reciprocal op and a single multiply op. (This seems to be common among the x86 architectures measured)

This is an excerpt from a table for the piledriver architecture.

MULSS MULSD    x,x/m    1  5-6   0.5   P01  fma
MULPS MULPD    x,x/m    1  5-6   0.5   P01  fma
VMULPS VMULPD  y,y,y/m  2  5-6   1     P01  fma
DIVSS DIVPS    x,x/m    1  9-24  5-10  P01  fp
VDIVPS         y,y,y/m  2  9-24  9-20  P01  fp
DIVSD DIVPD    x,x/m    1  9-27  5-10  P01  fp
VDIVPD         y,y,y/m  2  9-27  9-18  P01  fp
RCPSS/PS       x,x/m    1  5     1     P01  fp

The 4th value is latency. So the multiply ops take 5-6, the division ops take 9-24, and the reciprocal op takes 5 cycles. Since 24 > 6 + 5, I'm wondering why the 2 separate ops are faster than 1 single op to get essentially the same result.

I suspect the answer to this question involves the measurement of error. Perhaps it's the case that division is much more accurate than reciprocal plus multiply. If this is the case, how does the error measurement compare? Is there a linear relationship for example, since division is nearly twice as slow as reciprocal + multiply, is it also twice as accurate?

The error is documented. [`rpcss` is good to 11.5 binary places](http://www.felixcloutier.com/x86/RCPSS.html). On the other hand, `divss` is IEEE division, so it's good to 24 binary places. — Raymond Chen, Jul 13 '16 at 04:40
As the Intel docs say: "The RCPSS (compute reciprocal of scalar single-precision floating-point values) instruction computes the ***approximate*** reciprocal of the low single-precision floating-point value in the source operand and stores the result in the low doubleword of the destination operand." (emphasis mine) — Rudy Velthuis, Jul 13 '16 at 06:59
See this related question about square root and its reciprocal operation: http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x — Ross Ridge, Jul 13 '16 at 16:45

score 4 · Accepted Answer · edited May 23 '17 at 12:32

IIRC, the fast approximate reciprocal division and sqrt instructions are basically a table lookup (from an internal table), without the iterative refinement that makes accurate division / sqrt slow and hard to pipeline. This is why / how they are implemented with one-per-clock throughput.

Notice that divss throughput isn't much better than latency until very recent microarchitectures, and even Skylake's very impressive FP divide / sqrt unit isn't fully pipelined.

As for the rest of your question, the answers are the same as for rsqrt, so see this question Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

(Thanks Ross for digging up the link)

Why is fp division op slower than reciprocal op plus multiply op

1 Answers1