I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown:
.text
test:
#xorps %xmm0, %xmm0
cvtsi2ss %edi, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
retq
.global test
This function follows GCC/Clang's x86-64 calling convention for the function declaration extern "C" float test(int);
Note the commented out xorps
instruction. uncommenting this instruction dramatically improves the performance of the function. Testing it using my machine with an i7-8700K, Google benchmark shows the function without the xorps
instruction takes 8.54ns (CPU), while the function with the xorps
instruction takes 1.48ns. I've tested this on multiple computers with various OS's, processors, processor generations, and different processor manufacturers (Intel and AMD), and they all exhibit a similar performance difference. Repeating the addss
instruction makes the slowdown more pronounced (to a point), and this slowdown still occurs using other instructions here (eg. mulss
) or even a mix of instructions so long as they all depend on the value in %xmm0
in some way. It's worth pointing out that only calling xorps
each function call results in the performance improvement. Sampling the performance with a loop (as Google Benchmark does) with the xorps
call outside the loop still shows the slower performance.
Since this is a case where exclusively adding instructions improves performance, this appears to be caused by something really low-level in the CPU. Since it occurs across a wide variety of CPU's, it seems like this must be intentional. However, I couldn't find any documentation that explains why this happens. Does anybody have an explanation for what's going on here? The issue seems to be dependent on complicated factors, as the slowdown I saw in my original code only occurred on a specific optimization level (-O2, sometimes -O1, but not -Os), without inlining, and using a specific compiler (Clang, but not GCC).