10

Well on the Intel intrinsic guide it is stated that the instruction called "sqrtsd" has a latency of 18 cycles.

I tested it with my own program and it is correct if, for example, we take 0.15 as input. But when we take 256 (or any 2^x) number then the latency is only 13. Why is that?

One theory I had is that since 13 is the latency of "sqrtss" which is the same as "sqrtsd" but done on 32bits floating points then maybe the processor was smart enough to understand taht 256 can fit in 32 bit and hence use that version while 0.15 needs the full 64 bit since it isn't representable in a finite way.

I am doing it using inline assembly, here is the relveant part compiled with gcc -O3 and -fno-tree-vectorize.

static double sqrtsd (double x) {
    double r;
    __asm__ ("sqrtsd %1, %0" : "=x" (r) : "x" (x));
    return r;
}
Tommy95
  • 151
  • 1
  • 6
  • 3
    Show us the code of the test. I can imagine implementation where the optimization made by compiler and not the processor. – Robert Navado Mar 12 '20 at 20:15
  • Processors are not smart: they perform the instruction given. – Weather Vane Mar 12 '20 at 20:24
  • Does this answer your question? [Why is SSE scalar sqrt(x) slower than rsqrt(x) \* x?](https://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x) – Tarek Dakhran Mar 12 '20 at 20:37
  • More related: [How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?](https://stackoverflow.com/q/54642663) about how div/sqrt hardware works iteratively. – Peter Cordes Mar 12 '20 at 21:38
  • @TarekDakhran I am compiling with -O3 -fno-tree-vectorize so vectorization is disabled. – Tommy95 Mar 12 '20 at 22:10
  • 2
    You're not imagining things: [instlatx64](http://users.atw.hu/instlatx64/GenuineIntel/GenuineIntel0050654_SkylakeXeon9_InstLatX64.txt) for skylake also lists 18 (worst case) and 13 (simple values) – harold Mar 12 '20 at 22:12
  • 1
    Your inline asm makes no sense and won't compile: https://godbolt.org/z/rJA6nS. `"i"` specifies is an immediate and can't be an output constraint. `sqrtsd` only accepts a reg/mem input, not immediate, so wouldn't assemble even if it did compile. Also, using compile-time-constant immediates don't let you test latency, only throughput. But your numbers look sane so whatever you actually did probably tested sqrtsd latency. – Peter Cordes Mar 12 '20 at 22:13
  • @EricPostpischil I will add the entire code once I get home but I know the code itself isn't the issue since I am using it to benchmark many instructions and it always works fine. And no, I am sure I am not measuring the throughput. The weird thing is only that when not using a power of two as input then latency matches the one stated by Intel, but otherwise it is slightly better... – Tommy95 Mar 12 '20 at 22:14
  • @harold Yes that's exactly what I was trying to undestand! – Tommy95 Mar 12 '20 at 22:25
  • @Tommy95: Could you please tell me how do you benchmarks assembly instructions at this granular level? I think that's not easy stuff to do accurately, so I'm interested to know your approach. Do you have your code on your github or so where I can have a look? – Nawaz Aug 17 '20 at 16:26

1 Answers1

11

SQRT* and DIV* are the only two "simple" ALU instructions (single uop, not microcoded branching / looping) that have data-dependent throughput or latency on modern Intel/AMD CPUs. (Not counting microcode assists for denormal aka subnormal FP values in add/multiply/fma). Everything else is pretty much fixed so the out-of-order uop scheduling machinery doesn't need to wait for confirmation that a result was ready some cycle, it just knows it will be.

As usual, Intel's intrinsics guide gives an over-simplified picture of performance. The actual latency isn't a fixed 18 cycles for double-precision on Skylake. (Based on the numbers you chose to quote, I assume you have a Skylake.)

div/sqrt are hard to implement; even in hardware the best we can do is an iterative refinement process. Refining more bits at once (radix-1024 divider since Broadwell) speeds it up (see this Q&A about the hardware). But it's still slow enough that an early-out is used to speed up simple cases (Or maybe the speedup mechanism is just skipping a setup step for all-zero mantissas on modern CPUs with partially-pipelined div/sqrt units. Older CPUs had throughput=latency for FP div/sqrt; that execution unit is harder to pipeline.)


https://www.uops.info/html-instr/VSQRTSD_XMM_XMM_XMM.html shows Skylake SQRTSD can vary from 13 to 19 cycle latency. The SKL (client) numbers only show 13 cycle latency, but we can see from the detailed SKL vsqrtsd page that they only tested with input = 0. SKX (server) numbers show 13-19 cycle latency. (This page has the detailed breakdown of the test code they used, including the binary bit-patterns for the tests.) Similar testing (with only 0 for client cores) was done on the non-VEX sqrtsd xmm, xmm page. :/

InstLatx64 results show best / worst case latencies of 13 to 18 cycles on Skylake-X (which uses the same core as Skylake-client, but with AVX512 enabled).

Agner Fog's instruction tables show 15-16 cycle latency on Skylake. (Agner does normally test with a range of different input values.) His tests are less automated and sometimes don't exactly match other results.

What makes some cases fast?

Note that most ISAs (including x86) use binary floating point:
the bits represent values as a linear significand (aka mantissa) times 2exp, and a sign bit.

It seems that there may only be 2 speeds on modern Intel (since Haswell at least) (See discussion with @harold in comments.) e.g. even powers of 2 are all fast, like 0.25, 1, 4, and 16. These have trivial mantissa=0x0 representing 1.0. https://www.h-schmidt.net/FloatConverter/IEEE754.html has a nice interactive decimal <-> bit-pattern converter for single-precision, with checkboxes for the set bits and annotations of what the mantissa and exponent represent.

On Skylake the only fast cases I've found in a quick check are even powers of 2 like 4.0 but not 2.0. These numbers have an exact sqrt result with both input and output having a 1.0 mantissa (only the implicit 1 bit set). 9.0 is not fast, even though it's exactly representable and so is the 3.0 result. 3.0 has mantissa = 1.5 with just the most significant bit of the mantissa set in the binary representation. 9.0's mantissa is 1.125 (0b00100...). So the non-zero bits are very close to the top, but apparently that's enough to disqualify it.

(+-Inf and NaN are fast, too. So are ordinary negative numbers: result = -NaN. I measure 13 cycle latency for these on i7-6700k, same as for 4.0. vs. 18 cycle latency for the slow case.)

x = sqrt(x) is definitely fast with x = 1.0 (all-zero mantissa except for the implicit leading 1 bit). It has a simple input and simple output.

With 2.0 the input is also simple (all-zero mantissa and exponent 1 higher) but the output is not a round number. sqrt(2) is irrational and thus has infinite non-zero bits in any base. This apparently makes it slow on Skylake.

Agner Fog's instruction tables say that AMD K10's integer div instruction performance depends on the number of significant bits in the dividend (input), not the quotient, but searching Agner's microarch pdf and instruction tables didn't find any footnotes or info about how sqrt specifically is data-dependent.

On older CPUs with even slower FP sqrt, there might be more room for a range of speeds. I think number of significant bits in the mantissa of the input will probably be relevant. Fewer significant bits (more trailing zeros in the significand) makes it faster, if this is correct. But again, on Haswell/Skylake the only fast cases seem to be even powers of 2.


You can test this with something that couples the output back to the input without breaking the data dependency, e.g. andps xmm0, xmm1 / orps xmm0, xmm2 to set a fixed value in xmm0 that's dependent on the sqrtsd output.

Or a simpler way to test latency is to take "advantage" of the false output dependency of sqrtsd xmm0, xmm1 - it and sqrtss leave the upper 64 / 32 bits (respectively) of the destination unmodified, thus the output register is also an input for that merging. I assume this is how your naive inline-asm attempt ended up bottlenecking on latency instead of throughput with the compiler picking a different register for the output so it could just re-read the same input in a loop. The inline asm you added to your question is totally broken and won't even compile, but perhaps your real code used "x" (xmm register) input and output constraints instead of "i" (immediate)?

This NASM source for a static executable test loop (to run under perf stat) uses that false dependency with the non-VEX encoding of sqrtsd.

This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss a false dependency). AVX finally lets us avoid this with vsqrtsd dst, src,src at least for register sources, and similarly vcvtsi2sd dst, cold_reg, eax for the similarly near-sightedly designed scalar int->fp conversion instructions. (GCC missed-optimization reports: 80586, 89071, 80571.)


On many earlier CPUs even throughput was variable, but Skylake beefed up the dividers enough that the scheduler always knows it can start a new div/sqrt uop 3 cycles after the last single-precision input.

Even Skylake double-precision throughput is variable, though: 4 to 6 cycles after the last double-precision input uop, if Agner Fog's instruction tables are right. https://uops.info/ shows a flat 6c reciprocal throughput. (Or twice that long for 256-bit vectors; 128-bit and scalar can use separate halves of the wide SIMD dividers for more throughput but the same latency.) See also Floating point division vs floating point multiplication for some throughput/latency numbers extracted from Agner Fog's instruction tables.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • By the way, what about latencies between the two extremes? Do they happen? I couldn't make it happen on my Haswell, but that's not conclusive – harold Mar 12 '20 at 23:11
  • @harold: IDK, I'd guess that if it were possible, it would happen with some smaller number of trailing zeros in the mantissa. But maybe there's only one special-case early out detector for the most simple cases. Haswell's lower-radix divider should make it more profitable to look for an early out sooner, but maybe it's a matter of the initial estimate (from the same table rsqrt uses) being exact or not, and if not then it needs iterative refinement all the way to the end. – Peter Cordes Mar 12 '20 at 23:14
  • `rsqrt` is not exact for powers of two though (on Haswell anyway), but powers of two and zero are so far the only inputs I've found where the square root is fast, then again the `rsqrt` instructions seem to do more than just a lookup given how long their latency actually is – harold Mar 12 '20 at 23:18
  • @harold: `rsqrt` might not be the raw output of the LUT (yeah like you edited, the high latency might be some work). Or maybe it *leads to* the exact answer for simple inputs (all-zero mantissa). Or maybe all-zero mantissa can skip the LUT lookup *before* starting refinement. I don't know enough about HW dividers to rule out any of these guesses. :/ – Peter Cordes Mar 12 '20 at 23:22
  • 1
    Is `sqrtsd` fast for powers of two with odd exponents? Or only for powers of two with even exponents? This is interesting. – fuz Mar 13 '20 at 09:14
  • @fuz: ah you're right, I just tested using the `sqrtsd xmm2, xmm4` false output dependency method in a tight loop (1B iterations) and `4.0` is fast (13.013c / iter) while `2.0` is slow (18.016c/iter). So maybe it is only truly easy cases where the output is simple as well? – Peter Cordes Mar 13 '20 at 09:38
  • @PeterCordes Good question. I wonder if the same happens when you use a number like 9.0 with an integral square root. It could be that the instruction takes an early exit when all further output digits will be zero. – fuz Mar 13 '20 at 09:40
  • @fuz: 9.0 is slow: 18c/iter. NASM test loop and commands to build + run under perf: https://godbolt.org/z/C_JOAN (including the arith.divider_active perf counter which counts 7 cycles for fast, 12 cycles for slow). – Peter Cordes Mar 13 '20 at 09:46
  • @fuz: thanks again for prompting me to fix mis-information. Updated my answer with test results. – Peter Cordes Mar 13 '20 at 10:13