Why this unnecessary MOVAPD copy in gcc 9.1, in a tiny function

Question

Consider the following code:

double x(double a,double b) {
    return a*(float)b;
}

It does a conversion form double to float than again to double and multiplies.

When I compile it with gcc 9.1 with -O3 on x86/64 I get:

x(double, double):
        movapd  xmm2, xmm0
        pxor    xmm0, xmm0
        cvtsd2ss        xmm1, xmm1
        cvtss2sd        xmm0, xmm1
        mulsd   xmm0, xmm2
        ret

With clang and older versions of gcc I get this:

x(double, double):
        cvtsd2ss        xmm1, xmm1
        cvtss2sd        xmm1, xmm1
        mulsd   xmm0, xmm1
        ret

Here I do not copy xmm0 into xmm2 which seems unnecessary to me.

With gcc 9.1 and -Os I get:

x(double, double):
        movapd  xmm2, xmm0
        cvtsd2ss        xmm1, xmm1
        cvtss2sd        xmm0, xmm1
        mulsd   xmm0, xmm2
        ret

So it just removes the instruction which sets xmm0 to zero but not the moveapd.

I believe all three versions are correct, so could there be a performance benefit from the gcc 9.1 -O3 version? And if yes why? Does the pxor xmm0, xmm0 instruction has any benefit?

The issue is similar to Assembly code redundancy in optimized C code, but I don't think its the same because older versions of gcc do not generate the unnecessary copy.

As a wild guess, not being particularly familiar with this stuff, I'd say that the `movapd` is effectively free thanks to register renaming, and those extra instructions might eliminate some false dependencies. — Thomas Jager, Jul 28 '20 at 17:07
@ThomasJager: Nothing is ever free. It still costs a front-end uop, and code size (L1i cache footprint). It has zero latency in the back-end, and doesn't need an execution unit, but that's all - [Can x86's MOV really be "free"? Why can't I reproduce this at all?](https://stackoverflow.com/q/44169342). (Same for pxor-zeroing, which GCC only uses thanks to Intel's short-sighted bad design for one-source scalar instructions that don't zero-extend into the destination. — Peter Cordes, Jul 28 '20 at 17:26
There's no false dependency in clang's version, it's reading and writing the same register so the `cvtss2sd` output false dependency is already on the same register as it has an input true dependency on. Clang's version is optimal, gcc's version is dumb and a clear missed optimization. This happens a lot more often in tiny functions when GCC's register allocator does a poor job with hard-register constraints imposed by the calling convention; apparently GCC is not usually dumb like this between parts of larger functions. — Peter Cordes, Jul 28 '20 at 17:26
@PeterCordes can you elaborate on the pxor-zeroring? What is the purpose of it? Does `xmm0` have to be zero in the parts not used by the return value? But then the `-Os` would be incorrect. — Unlikus, Jul 28 '20 at 17:43
@Unlikus: No, parts of registers outside the return-value proper are allowed to contain garbage in x86 / x86-64 calling conventions. (Unlike some RISC calling conventions where integer regs at least have to be sign or zero-extended when passing/returning narrow values). Full details in [Is a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI?](https://stackoverflow.com/a/36760539). There is no purpose; the `pxor`-zeroing is only needed because of the `movapd` missed-optimization. — Peter Cordes, Jul 28 '20 at 17:47
If you just needed zero-extension of the result, `movq xmm0,xmm0` would be the most compact (although that does make the critical path latency longer because move-elimination doesn't work on that.) — Peter Cordes, Jul 28 '20 at 17:48

score 8 · Accepted Answer · answered Jul 28 '20 at 17:45

This is a GCC missed optimization; this is unfortunately not rare for GCC in tiny functions when its register allocator does a poor job with hard-register constraints imposed by the calling convention; apparently GCC is not usually dumb like this between parts of larger functions.

The pxor-zeroing is there to break the (false) output dependency of cvtss2sd, which exists because of Intel's short-sighted design for single-source scalar instructions to leave the upper part of the destination vector unmodified. They started this with SSE1 for PIII, where it gave a short-term gain because PIII handled XMM regs as two 64-bit halves, so only writing one half let instructions like sqrtss be single-uop.

But they unfortunately kept this pattern even for SSE2 (new with Pentium 4). And later declined to fix it with the AVX version of SSE instructions. So compilers are stuck choosing between the risks of creating a long loop-carried dependency chain through a false dependency, or of using pxor-zeroing. GCC conservatively always uses pxor at -O3, omitting it at -Os. (2-source operations like mulsd already depend on the destination as an input so this is unnecessary).

In this case, with its poor choice of register allocation, leaving out pxor-zeroing would mean that converting (float)b back to double couldn't start until a was ready. So if the critical path was a being ready (b ready early), omitting it would increase the latency from a->result by 5 cycles on Skylake (for the 2-uop cvtss2sd to run only after a was ready, because the output has to merge into the register that originally held a.) Otherwise it's just the mulsd that has to wait for a, with all the stuff involving b done ahead of time.

foo same,same is another way to work around an output dependency; that's what clang is doing. (And what GCC tries to do for popcnt, which unexpectedly has one on Sandybridge-family that's not architecturally required, unlike these stupid SSE ones.)

BTW, AVX 3-operand instructions do sometimes provide a way to work around the false dependencies, using a "cold" register, or one that was xor-zeroed, as the register to merge into. Including for scalar int->FP, although clang sometimes just uses movd plus packed-conversion for that.

Related: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? (I should have just linked that, I forgot I already wrote this up in that much detail on Stack Overflow recently.)

The movapd and pxor zeroing don't cost any latency on modern CPUs, but nothing is ever free. They still cost a front-end uop, and code size (L1i cache footprint). movapd has zero latency in the back-end, and doesn't need an execution unit, but that's all - Can x86's MOV really be "free"? Why can't I reproduce this at all?

Why this unnecessary MOVAPD copy in gcc 9.1, in a tiny function

1 Answers1