Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

Question

I've been trying to figure out a performance problem in an application and have finally narrowed it down to a really weird problem. The following piece of code runs 6 times slower on a Skylake CPU (i5-6500) if the VZEROUPPER instruction is commented out. I've tested Sandy Bridge and Ivy Bridge CPUs and both versions run at the same speed, with or without VZEROUPPER.

Now I have a fairly good idea of what VZEROUPPER does and I think it should not matter at all to this code when there are no VEX coded instructions and no calls to any function which might contain them. The fact that it does not on other AVX capable CPUs appears to support this. So does table 11-2 in the Intel® 64 and IA-32 Architectures Optimization Reference Manual

So what is going on?

The only theory I have left is that there's a bug in the CPU and it's incorrectly triggering the "save the upper half of the AVX registers" procedure where it shouldn't. Or something else just as strange.

This is main.cpp:

#include <immintrin.h>

int slow_function( double i_a, double i_b, double i_c );

int main()
{
    /* DAZ and FTZ, does not change anything here. */
    _mm_setcsr( _mm_getcsr() | 0x8040 );

    /* This instruction fixes performance. */
    __asm__ __volatile__ ( "vzeroupper" : : : );

    int r = 0;
    for( unsigned j = 0; j < 100000000; ++j )
    {
        r |= slow_function( 
                0.84445079384884236262,
                -6.1000481519580951328,
                5.0302160279288017364 );
    }
    return r;
}

and this is slow_function.cpp:

#include <immintrin.h>

int slow_function( double i_a, double i_b, double i_c )
{
    __m128d sign_bit = _mm_set_sd( -0.0 );
    __m128d q_a = _mm_set_sd( i_a );
    __m128d q_b = _mm_set_sd( i_b );
    __m128d q_c = _mm_set_sd( i_c );

    int vmask;
    const __m128d zero = _mm_setzero_pd();

    __m128d q_abc = _mm_add_sd( _mm_add_sd( q_a, q_b ), q_c );

    if( _mm_comigt_sd( q_c, zero ) && _mm_comigt_sd( q_abc, zero )  )
    {
        return 7;
    }

    __m128d discr = _mm_sub_sd(
        _mm_mul_sd( q_b, q_b ),
        _mm_mul_sd( _mm_mul_sd( q_a, q_c ), _mm_set_sd( 4.0 ) ) );

    __m128d sqrt_discr = _mm_sqrt_sd( discr, discr );
    __m128d q = sqrt_discr;
    __m128d v = _mm_div_pd(
        _mm_shuffle_pd( q, q_c, _MM_SHUFFLE2( 0, 0 ) ),
        _mm_shuffle_pd( q_a, q, _MM_SHUFFLE2( 0, 0 ) ) );
    vmask = _mm_movemask_pd(
        _mm_and_pd(
            _mm_cmplt_pd( zero, v ),
            _mm_cmple_pd( v, _mm_set1_pd( 1.0 ) ) ) );

    return vmask + 1;
}

The function compiles down to this with clang:

 0:   f3 0f 7e e2             movq   %xmm2,%xmm4
 4:   66 0f 57 db             xorpd  %xmm3,%xmm3
 8:   66 0f 2f e3             comisd %xmm3,%xmm4
 c:   76 17                   jbe    25 <_Z13slow_functionddd+0x25>
 e:   66 0f 28 e9             movapd %xmm1,%xmm5
12:   f2 0f 58 e8             addsd  %xmm0,%xmm5
16:   f2 0f 58 ea             addsd  %xmm2,%xmm5
1a:   66 0f 2f eb             comisd %xmm3,%xmm5
1e:   b8 07 00 00 00          mov    $0x7,%eax
23:   77 48                   ja     6d <_Z13slow_functionddd+0x6d>
25:   f2 0f 59 c9             mulsd  %xmm1,%xmm1
29:   66 0f 28 e8             movapd %xmm0,%xmm5
2d:   f2 0f 59 2d 00 00 00    mulsd  0x0(%rip),%xmm5        # 35 <_Z13slow_functionddd+0x35>
34:   00 
35:   f2 0f 59 ea             mulsd  %xmm2,%xmm5
39:   f2 0f 58 e9             addsd  %xmm1,%xmm5
3d:   f3 0f 7e cd             movq   %xmm5,%xmm1
41:   f2 0f 51 c9             sqrtsd %xmm1,%xmm1
45:   f3 0f 7e c9             movq   %xmm1,%xmm1
49:   66 0f 14 c1             unpcklpd %xmm1,%xmm0
4d:   66 0f 14 cc             unpcklpd %xmm4,%xmm1
51:   66 0f 5e c8             divpd  %xmm0,%xmm1
55:   66 0f c2 d9 01          cmpltpd %xmm1,%xmm3
5a:   66 0f c2 0d 00 00 00    cmplepd 0x0(%rip),%xmm1        # 63 <_Z13slow_functionddd+0x63>
61:   00 02 
63:   66 0f 54 cb             andpd  %xmm3,%xmm1
67:   66 0f 50 c1             movmskpd %xmm1,%eax
6b:   ff c0                   inc    %eax
6d:   c3                      retq

The generated code is different with gcc but it shows the same problem. An older version of the intel compiler generates yet another variation of the function which shows the problem too but only if main.cpp is not built with the intel compiler as it inserts calls to initialize some of its own libraries which probably end up doing VZEROUPPER somewhere.

And of course, if the whole thing is built with AVX support so the intrinsics are turned into VEX coded instructions, there is no problem either.

I've tried profiling the code with perf on linux and most of the runtime usually lands on 1-2 instructions but not always the same ones depending on which version of the code I profile (gcc, clang, intel). Shortening the function appears to make the performance difference gradually go away so it looks like several instructions are causing the problem.

EDIT: Here's a pure assembly version, for linux. Comments below.

    .text
    .p2align    4, 0x90
    .globl _start
_start:

    #vmovaps %ymm0, %ymm1  # This makes SSE code crawl.
    #vzeroupper            # This makes it fast again.

    movl    $100000000, %ebp
    .p2align    4, 0x90
.LBB0_1:
    xorpd   %xmm0, %xmm0
    xorpd   %xmm1, %xmm1
    xorpd   %xmm2, %xmm2

    movq    %xmm2, %xmm4
    xorpd   %xmm3, %xmm3
    movapd  %xmm1, %xmm5
    addsd   %xmm0, %xmm5
    addsd   %xmm2, %xmm5
    mulsd   %xmm1, %xmm1
    movapd  %xmm0, %xmm5
    mulsd   %xmm2, %xmm5
    addsd   %xmm1, %xmm5
    movq    %xmm5, %xmm1
    sqrtsd  %xmm1, %xmm1
    movq    %xmm1, %xmm1
    unpcklpd    %xmm1, %xmm0
    unpcklpd    %xmm4, %xmm1

    decl    %ebp
    jne    .LBB0_1

    mov $0x1, %eax
    int $0x80

Ok, so as suspected in comments, using VEX coded instructions causes the slowdown. Using VZEROUPPER clears it up. But that still does not explain why.

As I understand it, not using VZEROUPPER is supposed to involve a cost to transition to old SSE instructions but not a permanent slowdown of them. Especially not such a large one. Taking loop overhead into account, the ratio is at least 10x, perhaps more.

I have tried messing with the assembly a little and float instructions are just as bad as double ones. I could not pinpoint the problem to a single instruction either.

What compiler flags are you using? Perhaps the (hidden) process initialization is using some VEX instructions which is putting you in a mixed state from which you never exit. You could try copy/pasting the assembly and building it as a pure assembly program with `_start`, so that you avoid any of the compiler-inserted init code and see if it exhibits the same issue. — BeeOnRope, Dec 23 '16 at 20:35
@BeeOnRope I use `-O3 -ffast-math` but the effect is present even with `-O0`. I will try with pure assembly. You might be on to something as I just found out on [Agner's blog](http://agner.org/optimize/blog/read.php?i=415) that there have been some large internal changes to how VEX transitions are handled... will need to look into that. — Olivier, Dec 23 '16 at 20:53
Yes - but the oddness is that on Skylake the penalties are supposed to be greatly reduced for running in the "bad" mixed modes - but I didn't re-read it yet to refresh my memory on the details. — BeeOnRope, Dec 24 '16 at 19:07
@BeeOnRope that was my understanding as well but it's clearly not that simple. It seems to also have some kind of register specific tracking. In the asm version I added, if I introduce `vxorps` in the loop to clear ymm0-ymm5, it becomes faster but still not as fast as the VEX-free version. — Olivier, Dec 27 '16 at 16:56
What did you find when you ran it as pure assembly? Also, have you checked the contents of ymm* on entry to your routine? Curious if the upper bits are zero or something else. — BeeOnRope, Dec 27 '16 at 16:58
@BeeOnRope Everything is zero on entry. The only real finding is that somewhere in the code before `main()` there must have been a VEX instruction which I now have to insert myself to get the slowdown. This is scary as it implies that any piece of code in any library can permanently slow down my entire application long after I'm done calling it. At least until we move to AVX. — Olivier, Dec 27 '16 at 17:18
It's really weird it causes a (huge) and permanent slowdown. The pre-Skylake slowdowns were all about VEX _transitions_, so if you had code that didn't have any transitions itself, at most you could get a small(ish) fixed slowdown once when you started your loop (since a transition could occur), but then you shouldn't have further slowdowns after that. At least as far as I understood. — BeeOnRope, Dec 27 '16 at 17:21
I finally got off my ass and read the doc. The penalty is discussed pretty clearly in Intel's manual and while _different_ for Skylake, it is not necessary better - and in your case it is much worse. I added the details in an answer. — BeeOnRope, Dec 27 '16 at 17:54
I don't understand why you get an AVX instruction unless you compiled `main.cpp` with AVX and `slow_funciton.cpp` without e.g `g++ -c -O3 slow_function.cpp` and then `g++ -O3 -mavx slow_function.o main.cpp` — Z boson, Dec 30 '16 at 09:12
@Zboson the AVX instruction is in the dynamic linker but I don't know why they put it there either. See my comment on BeeOnRope's answer. It's a fairly ugly problem. — Olivier, Dec 30 '16 at 13:57
That's strange. Awesome that you figure that out. What do you though by a pure assembly version? You mean you wrote the assembly yourself? — Z boson, Dec 30 '16 at 14:31
@Zboson - since the OP already had the assembly output (from objdump or whatever), the assumption was that he could just copy it to an assembly file and compile it (along with adding a _start symbol, etc, to make it run standalone). — BeeOnRope, Dec 30 '16 at 17:36
@Zboson cut & pasted from the dump, reworked to avoid using any lib and trimmed down to keep only the essential part (removed function call, constants and branches which did not matter, etc). — Olivier, Jan 02 '17 at 04:13
@BeeOnRope I think I got it now, that answer my next question. Presumably the OP used this assembly to test on other systems because the SNB and IVB systems might not have had the same `/lib64/ld-linux-x86-64.so.2`. — Z boson, Jan 02 '17 at 09:22
Can you say exactly what SNB and IVB you tested on? More importantly, can you test on Haswell? It would be very interesting to see what you find on Haswell. — Z boson, Jan 02 '17 at 09:36
@zboson It think it is more like the dynamic linker method that's called when you compile as C/C++ isn't called when you compile as pure assembly. There's always a bunch of hidden runtime stuff that gets invoked when you are using C/C++ and I think this is included in that. — BeeOnRope, Jan 02 '17 at 14:02
@Olivier, can you tell me how you figured out the problem was at `_dl_runtime_resolve_avx(), /lib64/ld-linux-x86-64.so.2`. The reason I ask is I think I [see this problem somewhere else](http://stackoverflow.com/q/42324992/2542702) but I want to narrow it down. — Z boson, Feb 19 '17 at 20:54
@Zboson I think at some point my test case was slow a `printf()` in `main()` before the test loop and fast without. I traced in gdb with stepi and quickly landed in that function full of avx code and no vzeroupper. A few searches later, I had found the glibc issue which clearly said there was a problem there. I have since found that `memset()` is equally problematic but don't know why (the code looks ok). — Olivier, Feb 20 '17 at 19:58

BeeOnRope · Accepted Answer · 2018-10-18T05:33:00.667

You are experiencing a penalty for "mixing" non-VEX SSE and VEX-encoded instructions - even though your entire visible application doesn't obviously use any AVX instructions!

Prior to Skylake, this type of penalty was only a one-time transition penalty, when switching from code that used vex to code that didn't, or vice-versa. That is, you never paid an ongoing penalty for whatever happened in the past unless you were actively mixing VEX and non-VEX. In Skylake, however, there is a state where non-VEX SSE instructions pay a high ongoing execution penalty, even without further mixing.

Straight from the horse's mouth, here's Figure 11-1 ¹ - the old (pre-Skylake) transition diagram:

As you can see, all of the penalties (red arrows), bring you to a new state, at which point there is no longer a penalty for repeating that action. For example, if you get to the dirty upper state by executing some 256-bit AVX, an you then execute legacy SSE, you pay a one-time penalty to transition to the preserved non-INIT upper state, but you don't pay any penalties after that.

In Skylake, everything is different per Figure 11-2:

There are fewer penalties overall, but critically for your case, one of them is a self-loop: the penalty for executing a legacy SSE (Penalty A in the Figure 11-2) instruction in the dirty upper state keeps you in that state. That's what happens to you - any AVX instruction puts you in the dirty upper state, which slows all further SSE execution down.

Here's what Intel says (section 11.3) about the new penalty:

The Skylake microarchitecture implements a different state machine than prior generations to manage the YMM state transition associated with mixing SSE and AVX instructions. It no longer saves the entire upper YMM state when executing an SSE instruction when in “Modified and Unsaved” state, but saves the upper bits of individual register. As a result, mixing SSE and AVX instructions will experience a penalty associated with partial register dependency of the destination registers being used and additional blend operation on the upper bits of the destination registers.

So the penalty is apparently quite large - it has to blend the top bits all the time to preserve them, and it also makes instructions which are apparently independently become dependent, since there is a dependency on the hidden upper bits. For example xorpd xmm0, xmm0 no longer breaks the dependence on the previous value of xmm0, since the result is actually dependent on the hidden upper bits from ymm0 which aren't cleared by the xorpd. That latter effect is probably what kills your performance since you'll now have very long dependency chains that wouldn't expect from the usual analysis.

This is among the worst type of performance pitfall: where the behavior/best practice for the prior architecture is essentially opposite of the current architecture. Presumably the hardware architects had a good reason for making the change, but it does just add another "gotcha" to the list of subtle performance issues.

I would file a bug against the compiler or runtime that inserted that AVX instruction and didn't follow up with a VZEROUPPER.

Update: Per the OP's comment below, the offending (AVX) code was inserted by the runtime linker ld and a bug already exists.

¹ From Intel's optimization manual.

Great! I got confused by first reading an older version of the manual without the Skylake comments and then the newer version not far enough. Doesn't help that the newer version has fewer pages than the old one. I will definitely track down the offending lib. — Olivier, Dec 27 '16 at 18:46
The offending code is in _dl_runtime_resolve_avx(), /lib64/ld-linux-x86-64.so.2 . Seems like this should sort itself out with the next release of glibc: https://sourceware.org/bugzilla/show_bug.cgi?id=20495 — Olivier, Dec 27 '16 at 19:10
Interesting enough VZEROUPPER is not recommended on KNL but the situation is being debated https://software.intel.com/en-us/forums/intel-isa-extensions/topic/704023 — Z boson, Dec 28 '16 at 08:38
Why does the OP get an avx instruction in `main.cpp` and not in `slow_function.cpp` unless he compiled `main.cpp` with AVX and `slow_function.cpp` without? GCC should not insert AVX instruction unless it is told to because it would generate `SIGILL` on systems without AVX. — Z boson, Dec 30 '16 at 09:20
@Zboson - I didn't see anywhere the OP was compiling the two files with different AVX flags? He said that he doesn't get the issue if he enables AVX compilation, which makes sense since the only penalties on Skylake are for legacy SSE execution (Penalty A). Furthermore, the instructions aren't inserted by the compiler (you won't find them by inspecting the binary), but instead occur at runtime due to some method which is called inside the runtime linker, as Olivier mentions above (I added the link also to the end of my answer). — BeeOnRope, Dec 30 '16 at 17:39
@BeeOnRope So, if I will mix ymm and xmm register, I will encounter the penalty? Suppose I need to move 48 bytes of data. If I use the following sequence of instructions: "vmovdqu ymm0, ymmword ptr [rcx]; movups xmm1, xmmword ptr [rcx+32]; vmovdqu ymmword ptr [rdx], ymm0; movups xmmword ptr [rdx+32], xmm1", will I encounter the penalty? — Maxim Masiutin, May 09 '17 at 19:12
@MaximMasiutin something is wrong with the formatting of your code, so I can't parse it - but, briefly, the problem isn't with mixing `ymm` and `xmm` registers it is about mixing non-VEX and VEX encoded instructions. Anything with `ymm` is VEX-encoded, and anything with almost anything with 3-arguments is VEX-encoded, and I think all the vector instructions starting with `v` are VEX-encoded. So in your example if you change it to use `vmovdqu xmm, ...` you should be OK since that form is VEX-encoded. — BeeOnRope, May 09 '17 at 19:48
@BeeOnRope As far as I understood from the answer the transition VEX to non-VEX is perfromed each time upper part of `ymm` registers is seen as non-zero and then `xsave`ing to some memory. How does CPU know which memory location to use for such state saving? — St.Antario, Jan 10 '20 at 11:12
XSAVE takes a memory argument which is how it knows where to save. XSAVE doesn't play much into the penalties, although I didn't understand what you meant. XSAVE preserves the dirty/clean state of the upper bits. — BeeOnRope, Jan 10 '20 at 13:29

score 29 · Answer 2 · edited Mar 25 '19 at 20:39

29

I just made some experiments (on a Haswell). The transition between clean and dirty states is not expensive, but the dirty state makes every non-VEX vector operation dependent on the previous value of the destination register. In your case, for example movapd %xmm1, %xmm5 will have a false dependency on ymm5 which prevents out-of-order execution. This explains why vzeroupper is needed after AVX code.

edited Mar 25 '19 at 20:39

StaceyGirl

6,826
12
33
59

answered Dec 28 '16 at 09:52

A Fog

3,608
1
25
27

10

You are one of the heroes of this site's [x86] tag. Avid followers of the tag quote you extensively here, since you're one of the rare sources on microarchitectural details of x86 processors. Keep up your good work! – Iwillnotexist Idonotexist Dec 28 '16 at 09:59
Cool, but the new behavior described above (second diagram with hidden register dependencies) is apparently only for Skylake and newer? On Haswell it is supposed to save the upper halves away somewhere so that subsequent non-VEX operations are fast. – BeeOnRope Dec 28 '16 at 14:14
I don't have access to test a Skylake at the moment. – A Fog Dec 29 '16 at 14:33
3

@BeeOnRope, The OP said he did not have the problem on Sandy Bridge and Ivy Bridge, only on Skylake. The OP did not test Haswell. But Agner sees a problem on Haswell. So I am a bit confused because I would expect Haswell to act like Sandy Bridge and Ivy Bridge in this case. – Z boson Dec 30 '16 at 09:17
Yes, I'm confused too. – BeeOnRope Dec 30 '16 at 14:21
1

Is it possible that Haswell actually behaves like Skylake, but nobody described the behaviour until SKL came out? Or that it *sometimes* behaves this way? Any chance it's only a factor during the warm-up period before the upper halves of the 256b execution units power up? Maybe the state-transition behaviour is different during the period where AVX-256 instructions are slow? I just got a SKL desktop, and I have access to a Haswell laptop, so I may find some time to test this. Unfortunately I can't compare with IvB or SnB, which I assume do work the way you and Intel describe it. – Peter Cordes Dec 31 '16 at 16:41
3

Peter, the Haswell has a cost of 70 clock cycles for every state transition when VEX and non-VEX code is mixed, just like Sandy and Ivy Bridge. Skylake does not have any delay on state transitions, but I think it has the same false dependence as I described for Haswell. – A Fog Jan 01 '17 at 19:32
@PeterCordes, it's possible Intel's documentation is wrong. It would not be the first time. This should be easy to test. Run the OP's assembly on a SNB or IVB system and HSW system and compare. Maybe the OP has access to a HSW system. – Z boson Jan 02 '17 at 09:35
@BeeOnRope - I have made a question http://stackoverflow.com/questions/43879935/avoiding-avx-sse-vex-transition-penalties -- thanks. – Maxim Masiutin May 09 '17 at 21:11
1

Just as a fun fact (going to bed now, just digging, ping me if anyone cares) - it seems Skylake with/without the microcode patch to disable the loop stream decoder makes a difference (SOMEHOW) too - you have no idea how painful working out the cause has been, but I can now get a result reliably so... it is that. – Alec Teal Feb 06 '19 at 22:23

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

2 Answers2

Linked

Related