15

Fully knowing that these completely artificial benchmarks don't mean much, I am nonetheless a bit surprised by the several ways the "big 4" compilers chose to compile a trivial snippet.

struct In {
    bool in1;
    bool in2;
};

void foo(In &in) {
    extern bool out1;
    extern bool out2;
    out1 = (in.in1 == true);
    out2 = in.in2;
}

Notice: all compilers are set in x64 mode with highest "general purpose" (= no specific processor architecture specified) "optimize for speed" setting; you can see the results by yourself/play with them over at https://gcc.godbolt.org/z/K_i8h9)


Clang 6 with -O3 seems to produce the most straightforward output:

foo(In&):                             # @foo(In&)
        mov     al, byte ptr [rdi]
        mov     byte ptr [rip + out1], al
        mov     al, byte ptr [rdi + 1]
        mov     byte ptr [rip + out2], al
        ret

In a standard-conformant C++ program the == true comparison is redundant, so both assignments become straight copies from one memory location to the other, passing through al as there's no memory to memory mov.

However, as there's no register pressure here, I'd have expected it to use two different registers (to completely avoid false dependency chains between the two assignments), possibly starting all the reads first, and doing all the writes after, to help instruction-level parallelism; is this kind of optimization completely obsolete with recent CPUs due to register renaming and aggressively out-of-order CPUs? (more on this later)


GCC 8.2 with -O3 does almost the same thing, but with a twist:

foo(In&):
        movzx   eax, BYTE PTR [rdi]
        mov     BYTE PTR out1[rip], al
        movzx   eax, BYTE PTR [rdi+1]
        mov     BYTE PTR out2[rip], al
        ret

Instead of a plain mov to the "small" register, it does a movzx to full eax. Why? Is this to reset completely the state of eax and sub-registers in the register renamer to avoid partial register stalls?


MSVC 19 with /O2 adds one more quirk:

in$ = 8
void foo(In & __ptr64) PROC                ; foo, COMDAT
        cmp     BYTE PTR [rcx], 1
        sete    BYTE PTR bool out1         ; out1
        movzx   eax, BYTE PTR [rcx+1]
        mov     BYTE PTR bool out2, al     ; out2
        ret     0
void foo(In & __ptr64) ENDP                ; foo

Besides the different calling convention, here the second assignment is pretty much the same.

However, the comparison in the first assignment is actually performed (interestingly, using both a cmp and a sete with memory operands, so you could say that the intermediate register is FLAGS).

  • Is this VC++ explicitly playing it safe (the programmer asked for this, maybe he knows something I don't about that bool) or is due to some known inherent limitation - e.g. bool being treated as a plain byte with no particular properties immediately after the frontend?
  • As it's not a "real" branch (the code path is not altered by the result of the cmp) I'd expect this not to cost that much, especially compared to accessing memory. How costly is this missed optimization?

Finally, ICC 18 with -O3 is the strangest of all:

foo(In&):
        xor       eax, eax                                      #9.5
        cmp       BYTE PTR [rdi], 1                             #9.5
        mov       dl, BYTE PTR [1+rdi]                          #10.12
        sete      al                                            #9.5
        mov       BYTE PTR out1[rip], al                        #9.5
        mov       BYTE PTR out2[rip], dl                        #10.5
        ret                                                     #11.1
  • The first assignment does the comparison, exactly as in VC++ code, but the sete goes through al instead of straight to memory; is there any reason to prefer this?
  • All reads are "started" before doing anything with the results - so this kind of interleaving still actually matters?
  • Why is eax zeroed out at the start of the function? Partial register stalls again? But then dl doesn't get this treatment...

Just for fun, I tried removing the == true, and ICC now does

foo(In&):
        mov       al, BYTE PTR [rdi]                            #9.13
        mov       dl, BYTE PTR [1+rdi]                          #10.12
        mov       BYTE PTR out1[rip], al                        #9.5
        mov       BYTE PTR out2[rip], dl                        #10.5
        ret                                                     #11.1

so, no zeroing out of eax, but still using two registers and "start reading in parallel first, use all the results later".

  • What's so special about sete that makes ICC think it's worth zeroing out eax before?
  • Is ICC right after all to reorder reads/writes like this, or the apparently more sloppy approach of the other compilers currently performs the same?
Matteo Italia
  • 115,256
  • 16
  • 181
  • 279
  • Which register is used doesn't matter as all modern x86 CPUs use register renaming. – fuz Sep 13 '18 at 20:34
  • @fuz: but the ICC compiler seems convinced otherwise, and one would think Intel knows a thing or two about their processors (although honestly I pretty much always found inferior output compared to gcc, at least on high-level optimizations). That's why I'm asking. – Matteo Italia Sep 13 '18 at 20:35
  • I read this https://blog.regehr.org/archives/1603 and then went for a lie down in a darkened room. – Richard Critten Sep 13 '18 at 20:35
  • @RichardCritten: heh, but seems really interesting, I'll surely take a look! – Matteo Italia Sep 13 '18 at 20:36
  • 4
    @fuz: Only Intel before IvyBridge renames AL separately from RAX (P6 family and SnB itself, but not later SnB-family) [How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent](https://stackoverflow.com/q/45660139). **On all other uarches (including Haswell/Skylake), writing AL merges into RAX** ([Why doesn't GCC use partial registers?](https://stackoverflow.com/q/41573502)). Clang doesn't seem to understand partial registers very well, and creates false deps and partial-reg penalties for no reason sometimes. – Peter Cordes Sep 13 '18 at 20:48
  • It's not going to make a difference on a deeply out of order x86, nor will reordering the memory references make a difference because the accesses cannot alias (and thus even x86's strong-ish memory model can reorder the accesses in hardware), with the possible exception of false aliasing (if the source struct is on the corresponding address within a different 4096 byte page), in which case the Intel compiler might have a point. Don't overalign your structures... – EOF Sep 13 '18 at 20:49
  • 3
    Who voted to close this as "opinion based"? *Some* people might be inclined to post wild shot-in-the-dark opinions based on guesses, but there are real reasons on actual x86 uarches for most of this. (I can see the too-broad reason, though. You'd think that one tiny function would be simple, but it's often the case that something this small can be a test-case for a whole handful of missed-optimization bug reports for a single compiler.) – Peter Cordes Sep 13 '18 at 22:15
  • @PeterCordes: I agree completely on both accounts. Indeed, I was tempted to split this into four separate questions, but (1) that occurred after I already wrote most of the question, (2) I feared that it would have yielded to much repetition in the answers, as many of the themes here (such as partial register stalls and dependency chains) are trasversal to all these examples, and (3) that it would have lost part of the benefit of comparing together the output in a single question. OTOH I realize that many sub-questions are actually quite specific, so I don't know? – Matteo Italia Sep 13 '18 at 22:41
  • 1
    I think it works better as one question, because there's so much overlap in the reasoning behind why one is better vs. another is worse. And yeah avoiding false deps is the common problem for all of them. – Peter Cordes Sep 13 '18 at 22:55
  • 2
    I think the close votes are not justified. The answer is all facts and the question is very specific -- comparing the perf of 4 short pieces of code. – Hadi Brais Sep 13 '18 at 23:00

2 Answers2

15

TL:DR: gcc's version is the most robust across all x86 uarches, avoiding false dependencies or extra uops. None of them are optimal; loading both bytes with one load should be even better.

The 2 key points here are:

  • The mainstream compilers only care about out-of-order x86 uarches for their default tuning for instruction selection and scheduling. All x86 uarches that are currently sold do out-of-order execution with register renaming (for full registers like RAX at least).

    No in-order uarches are still relevant for tune=generic. (Older Xeon Phi, Knight's Corner, used modified Pentium P54C-based in-order cores, and in-order Atom system might still be around, but that's obsolete now, too. In that case it would be important to do the stores after both loads, to allow memory-parallelism in the loads.)

  • 8 and 16-bit Partial registers are problematic, and can lead to false dependencies. Why doesn't GCC use partial registers? explains the different behaviours for a variety of x86 uarches.


  1. partial-register renaming to avoid false dependencies:

Intel before IvyBridge renames AL separately from RAX (P6 family and SnB itself, but not later SnB-family). On all other uarches (including Haswell/Skylake, all AMD, and Silvermont / KNL), writing AL merges into RAX. For more about modern Intel (HSW and later) vs. P6-family and first-gen Sandybridge, see this Q&A: How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent.

On Haswell/Skylake, mov al, [rdi] decodes to a micro-fused ALU + load uop that merges the load result into RAX. (This is nice for bitfield merging, instead of having extra cost for the front-end to insert a later merging uop when reading the full register).

It performs identically to how add al, [rdi] or add rax, [rdi]. (It's only an 8-bit load, but it has a dependency on the full width of the old value in RAX. Write-only instructions to low-8/low-16 regs like al or ax are not write-only as far as the microarchitecture is concerned.)

On P6 family (PPro to Nehalem) and Sandybridge (first generation of Sandybridge-family), clang's code is perfectly fine. Register-renaming makes the load/store pairs totally independent from each other, as if they'd used different architectural registers.

On all other uarches, Clang's code is potentially dangerous. If RAX was the target of some earlier cache-miss load in the caller, or some other long dependency chain, this asm would make the stores dependent on that other dep-chain, coupling them together and removing the opportunity for the CPU to find ILP.

The loads are still independent, because the loads are separate from the merging and can happen as soon as the load address rdi is known in the out-of-order core. The store-address is also known, so the store-address uops can execute (so later loads/stores can check for overlap), but the store-data uops are stuck waiting for the merge uops. (Stores on Intel are always 2 separate uops, but they can micro-fuse together in the front-end.)

Clang doesn't seem to understand partial registers very well, and creates false deps and partial-reg penalties for no reason sometimes, even when it doesn't save any code-size by using a narrow or al,dl instead of or eax,edx, for example.

In this case it saves a byte of code size per load (movzx has a 2-byte opcode).

  1. Why does gcc use movzx eax, byte ptr [mem]?

Writing EAX zero-extends to the full RAX, so it's always write-only with no false dependency on the old value of RAX on any CPU. Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?.

movzx eax, m8/m16 is handled purely in the load ports, not as a load + ALU-zero-extend, on Intel, and on AMD since Zen. The only extra cost is 1 byte of code-size. (AMD before Zen has 1 cycle of extra latency for movzx loads, and apparently they have to run on an ALU as well as a load port. Doing sign/zero-extension or broadcast as part of a load with no extra latency is the modern way, though.)

gcc is pretty fanatical about breaking false dependencies, e.g. pxor xmm0,xmm0 before cvtsi2ss/sd xmm0, eax, because Intel's poorly-designed instruction set merges into the low qword of the destination XMM register. (Short-sighted design for PIII which stores 128-bit registers as 2 64-bit halves, so int->FP conversion instructions would have taken an extra uop on PIII to also zero the high half if Intel had designed it with future CPUs in mind.)

The problem isn't usually within a single function, it's when these false dependencies end up creating a loop-carried dependency chain across call/ret in different functions that you can unexpectedly get a big slowdown.

For example, store-data throughput is only 1 per clock (on all current x86 uarches), so 2 loads + 2 stores already takes at least 2 clocks.

If the struct is split across a cache line boundary, though, and the first load misses but the 2nd hits, avoiding a false dep would let the 2nd store write data to the store buffer before the first cache miss is finished. This would let loads on this core read from out2 via store-forwarding. (x86's strong memory ordering rules prevent the later store from becoming globally visible by committing to the store buffer ahead of the store to out1, but store-forwarding within a core/thread still works.)


  1. cmp/setcc: MSVC / ICC are just being dumb

The one advantage here is that putting the value into ZF avoids any partial-register shenanigans, but movzx is a better way to avoid it.

I'm pretty sure MS's x64 ABI agrees with the x86-64 System V ABI that a bool in memory is guaranteed to be 0 or 1, not 0 / non-zero.

In the C++ abstract machine, x == true has to be the same as x for a bool x, so (unless an implementation used different object-representation rules in structs vs. extern bool), it can always just copy the object representation (i.e. the byte).

If an implementation was going to use a one-byte 0 / non-0 (instead of 0 / 1) object representation for bool, it would need to cmp byte ptr [rcx], 0 to implement the booleanization in (int)(x == true), but here you're assigning to another bool so it could just copy. And we know it's not booleanizing 0 / non-zero because it compared against 1. I don't think it's intentionally being defensive against invalid bool values, otherwise why wouldn't it do that for out2 = in.in2?

This just looks like a missed-optimization. Compilers are not generally awesome at bool in general. Boolean values as 8 bit in compilers. Are operations on them inefficient?. Some are better than others.

MSVC's setcc directly to memory is not bad, but cmp + setcc is 2 extra unnecessary ALU uops that didn't need to happen. For Ryzen, setcc m8 is 1 uop, 1/clock throughput (https://uops.info/). (Agner Fog reports one per 2 clocks for it, https://agner.org/optimize/. That's probably a typo, or maybe different measurement methodology, because automated testing/reporting by https://uops.info/ finds setcc [mem] is 1/clock throughput. On Steamroller, it's 1 uop / 1 per clock, and Zen didn't make much if anything worse than Bulldozer-family.)

On Intel, setcc m8 is 2 fused-domain uops (ALU + micro-fused store, for 3 back-end uops) and 1 per clock throughput, like you'd expect. (Or better than 1/clock on Ice Lake with its extra store port, but still worse than register.)

Note that setcc can only decode in the "complex" decoder on Intel for reasons, because setbe / seta are (still unfortunately) 2 uops.

  1. ICC's xor-zeroing before setz

I'm not sure if there's an implicit conversion to int anywhere in here in ISO C++'s abstract machine, or if == is defined for bool operands.

But anyway, if you are going to setcc into a register, it's not a bad idea to xor-zero it first for the same reason movzx eax,mem is better than mov al,mem. Even if you don't need the result zero-extended to 32-bit.

That's probably ICC's canned sequence for creating a boolean integer from a compare result.

It makes little sense to use xor-zero / cmp / setcc for the compare, but mov al, [m8] for the non-compare. The xor-zero is the direct equivalent of using a movzx load to break the false dependency here.

ICC is great at auto-vectorizing (e.g. it can auto-vectorize a search-loop like while(*ptr++ != 0){} while gcc/clang can only auto-vec loops with a trip count that's known ahead of the first iteration). But ICC is not great at little micro-optimizations like this; it often has asm output that looks more like the source (to its detriment) than gcc or clang.

  1. all reads "started" before doing anything with the results - so this kind of interleaving still actually matters?

It's not a bad thing. Memory disambiguation usually allows loads after stores to run early anyway. Modern x86 CPUs even dynamically predict when a load won't overlap with earlier unknown-address stores.

If the load and store address are exactly 4k apart, they alias on Intel CPUs, and the load is falsely detected as dependent on the store.

Moving loads ahead of stores definitely makes things easier for the CPU; do this when possible.

Also, the front-end issues uops in-order into the out-of-order part of the core, so putting the loads first can let the 2nd one start maybe a cycle earlier. There's no benefit to having the first store done right away; it will have to wait for the load result before it can execute.

Reusing the same register does reduce register pressure. GCC likes to avoid register pressure all the time, even when there isn't any, like in this not-inlined stand-alone version of the function. In my experience, gcc tends to lean towards ways of generating code that create less register pressure in the first place, rather than only reining in its register use when there is actual register pressure after inlining.

So instead of having 2 ways of doing things, gcc sometimes just only has the less-register-pressure way which it uses even when not inlining. For example, GCC used to almost always use setcc al / movzx eax,al to booleanize, but recent changes have let it use xor eax,eax / set-flags / setcc al to take the zero-extension off the critical path when there's a free register that can be zeroed ahead of whatever sets flags. (xor-zeroing also writes flags).


passing through al as there's no memory to memory mov.

None worth using for single-byte copies, anyway. One possible (but sub-optimal) implementation is:

foo(In &):
    mov   rsi, rdi
    lea   rdi, [rip+out1]
    movsb               # read in1
    lea   rdi, [rip+out2]
    movsb               # read in2

An implementation that's probably better than any the compilers spotted is:

foo(In &):
    movzx  eax, word ptr [rdi]      # AH:AL = in2:in1
    mov    [rip+out1], al
    mov    [rip+out2], ah
    ret

Reading AH may have an extra cycle of latency, but this is great for throughput and code-size. If you care about latency, avoid the store/reload in the first place and use registers. (By inlining this function).

If the two struct members were written separately and very recently, this 2-byte load will incur a store-forwarding stall. (Just extra latency for this store-forwarding, not actually stalling the pipeline until the store buffer drains.)

The other microarchitectural danger with this is a cache-line split on the load (if in.in2 is the first byte of a new cache line). That could take an extra 10 cycles. Or on pre-Skylake, if it's also split across a 4k boundary the penalty can be 100 cycles extra latency. But other than that, x86 has efficient unaligned loads, and it's normally a win to combine narrow loads / stores to save uops. (gcc7 and later typically do this when initializing multiple struct members even in cases where it can't know that it won't cross a cache-line boundary.)

The compiler should be able to prove that In &in can't alias extern bool out1, out2, because they have static storage and different types.

If you just had 2 pointers to bool, you wouldn't know (without bool *__restrict out1) that they don't point to members of the In object. But static bool out2 can't alias members of a static In object. Then it wouldn't be safe to read in2 before writing out1, unless you checked for overlap first.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • 1
    Since an unintialized bool can behave as if it is neither true nor false, the explicit comparison to `true` is potentially different from just copying the value. Since using an unintialized value result in Undefined Behavior, both approaches are valid. – 1201ProgramAlarm Sep 13 '18 at 22:23
  • @1201ProgramAlarm nobody says it's not conformant, just that it's stupid from a performance standpoint. – Matteo Italia Sep 13 '18 at 22:37
  • 1
    interesting look for code `bool fn(bool b) { return b == 2; }` for view - from which version compiler assume that `bool` can be only 0 or 1. say `icc` - assume that `bool` can have any byte value. when all latest versions of another compilers - simply return false here. this explain why icc use cmp with 1. strange that msvc begin from 17 version assume that bool only 1 or 0, but anyway use cmp with 1 in more complex original example – RbMm Sep 14 '18 at 00:30
  • with msvc also exist sense compare -O1 (smallest code)with -O2 (fastest code) optimization (so O1 not less than O2 but another, dont know what mean O123 in another compilers) say in `bool fn(bool b) { return b == 1; }` with `O2 used `movzx eax, cl` when with -O1 `mov al, cl` - so `movzx eax,cl` is faster that `mov al,cl` ? i naively thinked before that visa versa – RbMm Sep 14 '18 at 00:37
  • 1
    @RbMm: `gcc -Os` optimizes for size more than speed. `clang -Os` is similar. `clang -Oz` optimizes aggressively for size, doing stuff like `push 2` / `pop rax` instead of `mov eax,2`. gcc/clang -O3 are full optimization for speed without concern for size (size always matters for I$ pressure and stuff like that, but -O3 only cares about size when all else is equal). – Peter Cordes Sep 14 '18 at 00:41
  • 1
    @RbMm: `movzx eax, cl` is never worse than `mov al,cl`, and can be faster. On IvyBridge and later, `movzx r32, r8` has zero latency (mov-elimination), while `mov al,cl` always needs an ALU uop. (On SnB and older, partial-register renaming avoids a false dep, but they don't have mov-elimination. On newer Intel, and on AMD, `mov al,cl` is a merge into RAX and mov-elimination can never work on it.) `movzx eax, cl` avoids a false dependency on RAX, which matters on all CPUs other than P6-family and first-gen SnB – Peter Cordes Sep 14 '18 at 00:44
  • I've run all codes on Haswell. See my answer for the results. – Hadi Brais Sep 14 '18 at 01:25
  • The answer says the false dependency should exit in both Haswell and Skylake, but we are seeing a difference in performance. – Hadi Brais Sep 14 '18 at 02:59
  • 1
    We can trivially prove there's a false dep by adding an `add eax,eax` into the loop and seeing it slow down to 3c per iter. With perfect scheduling, there's no reason to expect any slowdown in your loop. It seems Haswell doesn't do a perfect job if you add extra uops that change it from a simple loop-carried dep chain. – Peter Cordes Sep 14 '18 at 03:06
  • @PeterCordes With `add eax,eax`, the input operands depend on the last write to `al`, so it has to wait for that particular instruction `mov al, byte [rdi + 1]` from the previous iteration to complete, which seems to take about 2 cycles. This true dependency does not exits when using `mov rax, qword [rdi+64]`. Also the instruction `mov al, byte [rdi]` has to wait one cycle for the result of `add eax,eax` to correctly merge the value of the register. This is the false dependency. Both deps result in serializing `add eax,eax` from both directions. In contrast to using `mov rax, qword [rdi+64]`. – Hadi Brais Sep 14 '18 at 03:55
  • But doesn't `mov rax, qword [rdi+64]` has to wait for the previous `mov al, byte [rdi + 1]`? If that was the case, then using this instruction should make the loop run at 3c per iteration, not 2.16. – Hadi Brais Sep 14 '18 at 03:58
  • @HadiBrais: yes, exactly. `mov al, mem` / `mov al, mem` is a 2-cycle dep chain for RAX, and `add eax,eax` lengthens that dep chain by another 1 cycle. – Peter Cordes Sep 14 '18 at 03:58
  • 1
    @HadiBrais: But no, `mov rax, qword [rdi+64]` does *not* have to wait for any previous stuff involving RAX or AL, because it's a write-only access with no true or false dependency on the old value. It starts a new dep chain for that architectural register); that's the whole point of reg renaming with Tomasulo's algorithm. As you say, that's why it doesn't make it 3c. – Peter Cordes Sep 14 '18 at 04:00
  • @PeterCordes I've edited my answer accordingly. Your code is actually slightly better for the case with `add eax,eax`. – Hadi Brais Sep 14 '18 at 04:34
  • @HadiBrais: On Skylake, I can't repro any slowdown with gcc's sequence. That's weird, I have no clue why `add eax,eax` would ever slow down with `movzx` loads, because nothing depends on it and it just has to wait for the result of the 2nd `movzx` load. Nice update to your answer with a diagram of the dependencies, though. – Peter Cordes Sep 14 '18 at 04:40
  • This is the code I'm using for gcc: `add eax,eax movzx eax, byte [rdi] mov byte [rsi + 4], al movzx eax, byte [rdi+1] mov byte [rsi + 8], al` in a loop. – Hadi Brais Sep 14 '18 at 04:45
  • I hope I got all the dependencies right, because it can be a little confusing. – Hadi Brais Sep 14 '18 at 04:46
  • @HadiBrais: I checked; you did get the dep diagram correct. And yeah, that's identical to what I tested, with `dec ebp/jnz` as the loop branch. Top of loop aligned by 32 (not that alignment should matter, especially on Haswell where the LSD loop buffer works.) – Peter Cordes Sep 14 '18 at 04:49
  • 1
    Anyway, the code you've proposed at the bottom of your answer is indeed empirically optimal at least on Haswell. Congrats! – Hadi Brais Sep 14 '18 at 04:52
  • 1
    So even without `add eax,eax` or `mov rax, qword [rdi+64]`, there is still a false dependency between the two instructions that write to `al`. My understanding is that the two loads get issued and completed in perfect parallelism. Then the first one gets written back in one cycle and the next one in the next cycle. That's why it takes 2 cycles per iteration. The loads of the next iterations can also happen in parallel in the next cycles. The memory store instructions do not cause any additional delay. I'm kinda impressed. – Hadi Brais Sep 14 '18 at 05:00
  • 1
    @HadiBrais: You were testing with an aligned (or at least not cache-line-split) source. That's the only weakness of the word-load. (Well, reading AH is another potential danger on HSW/SKL; extra latency for that store-data uop I guess, so good to know that doesn't affect the pipeline). And yes, that's an accurate summary of the (false) dep chain through AL for the byte loads. Except "written back" is the wrong word; it's a merge uop that runs on the ALU just like an `add` or `or` would, combining the 64-bit old value with the new 8-bit low byte to produce a new 64-bit value. – Peter Cordes Sep 14 '18 at 05:06
  • @PeterCordes: later I'm going to re-read your answer with full attention, but I'm a bit confused by the bits about inlining: seeing the code generated after inlining (very specific to the inlining site and blended with the surroundings), I always assumed that it happens at a quite high level, and the optimizations/codegen is done *after* merging the parent function with the inlined ones. Hence, codegen for the "regular" version should be completely separated, so I wouldn't think the compiler here would have to take any precaution to generate code that is also good for inlining. Am I mistaken? – Matteo Italia Sep 14 '18 at 07:09
  • 1
    @MatteoItalia: you're right about how inlining works. That point was hard to make clearly, because I don't know gcc internals in detail. But in my experience, it tends to lean towards ways of generating code that create less register pressure in the first place, rather than only reining in its register use when there is actual register pressure after inlining. So instead of having 2 ways of doing things, it sometimes just always has the less-register-pressure way which it uses even when not inlining. – Peter Cordes Sep 14 '18 at 07:18
  • @PeterCordes: oh ok, put in this way it's way clearer and more convincing, thank you. – Matteo Italia Sep 14 '18 at 07:19
  • 1
    @MatteoItalia: Thanks for the feedback on which part was unclear, edited the answer with that 2nd attempt at an explanation from my last comment. – Peter Cordes Sep 14 '18 at 07:28
6

I've run all codes in a loop on Haswell. The following graph shows the execution time of each for 1 billion iterations in three cases:

  • There is a mov rax, qword [rdi+64] at the beginning of every iteration. This potentially creates a false register dependency (called dep in the graph).
  • There is a add eax, eax at the beginning of every iteration (called fulldep in the graph). This creates a loop-carried dependency and a false dependency. See also the image below for an illustration of all the true and false dependencies of add eax, eax, which also explains why it serializes execution in both directions.
  • Only partial register dependency (called nodep in the graph, which stands for no false dependency). So this case has one less instruction per iteration compared to the previous one.

In both cases, the same memory locations are being accessed in every iteration. For example, the Clang-like code that I tested looks like this:

mov     al, byte [rdi]
mov     byte [rsi + 4], al
mov     al, byte [rdi + 1]
mov     byte [rsi + 8], al

This is placed in a loop where rdi and rsi never change. There is no memory aliasing. The results clearly show that partial register dependencies inflict a 7.5% slowdown on Clang. Peter, MSVC, and gcc are all clear winners in terms of absolute performance. Also note that for the second case, Peter's code is doing slightly better (2.02c per iteration for gcc and msvc, 2.04c for icc, but only 2.00c for Peter). Another possible metric of comparison is code size.

enter image description here

enter image description here

Hadi Brais
  • 18,864
  • 3
  • 43
  • 78
  • Your names seem backwards. `nodep` is the one where you don't *break* the partial-register dependency chain, so there is a loop-carried dep chain through RAX on Haswell. `mov rax, qword [rdi+64]` is independent of the `mov al, byte [rdi + 1]` in the previous iteration, breaking the loop-carried dependency chain just like `xor eax,eax` would. I'm surprised it's slower, but I guess more uops and imperfect scheduling maybe leads to some missed store-data cycles? – Peter Cordes Sep 14 '18 at 01:35
  • @PeterCordes Yeah I've intentionally avoided loop-carried deps just out of curiosity to see what happens. Also I wanted all accesses to hit in the cache because I wanted to know whether this matters. So it seems that any instruction even with moderate latency (like L1 hit) can have a fairly serious impact. I'm not exactly sure what's happening, still thinking about it. The naming refers to "no false dependency" and "with false dependency." :) – Hadi Brais Sep 14 '18 at 01:44
  • You're saying that `mov rax, qword [rdi+64]` "*potentially* creates *a partial register dependency*", but that's backwards. `mov al, [rdi]` already has a false dependency on the old value of AL/AX/EAX/RAX on Haswell; it's a merge into RAX. Without any extra instructions, the clang loop body you show in the question *has* a loop-carried dependency chain, which `mov rax, qword [rdi+64]` *breaks* by being a write-only access to RAX (from a source that only depends on RDI), instead of a merge. – Peter Cordes Sep 14 '18 at 01:50
  • Did you try loading into a different register to see if it's just having an extra load in the mix that hurts (`mov r8, [rdi+64]`)? Or an instruction like `imul eax, edi, 1234` which is also write-only for RAX but doesn't use a load port? (vs. a lower latency instruction like `lea eax, [rdi]`). The latency of the instruction that writes RAX shouldn't really matter, as long as it doesn't have a false output dependency like `popcnt eax, edi` or `tzcnt` on Haswell. Register renaming makes it independent, and RDI is cold (always ready). – Peter Cordes Sep 14 '18 at 01:53
  • Yeah I got it backwards, thanks for pointing it out. Yes, `mov rax, qword [rdi+64]` breaks the loop-carried dependency. I'll try all of your three suggestions (load to different register, using imul, and using lea). – Hadi Brais Sep 14 '18 at 01:59
  • I just tried your clang loop with/without the extra load on Skylake, and found both run at 2.0 cycles per iteration to within ~2 parts in 1000. (2.004 billion cycles for 1G iters) with or without `mov rax, [rdi+64]` at the top of the loop. You shouldn't be seeing the effects of the loop buffer (which is disabled by microcode on SKL); I don't think front-end effects should matter so that doesn't explain the diff between SKL and HSW. It's not a super high uop-throughput rate either, bottlenecking on the merge + store. – Peter Cordes Sep 14 '18 at 02:11
  • 1
    @PeterCordes Loading to a different register, RBX, makes it run at 2.0 cycles per iteration. I think it's clear that the perf degradation is because of writing to RAX. My measurements are very consistent with tiny standard deviation. – Hadi Brais Sep 14 '18 at 02:27
  • @PeterCordes Using `imul` instead of the load, it runs at 2.46 cycles per iteration. Using `lea` instead, it runs at 2.37 cycles per iteration, both worse than using load. – Hadi Brais Sep 14 '18 at 02:32
  • Interesting. Sounds like it's a real effect on Haswell but not Skylake. I did try `imul eax, edi, 1235` on SKL and got 2.005 c/i, so again no effect. – Peter Cordes Sep 14 '18 at 02:35
  • @PeterCordes My understanding is that the pipeline waits for `mov rax, qword [rdi+64]` to write the results into `rax` before writing the value from `mov al, byte [rdi]` into `al`. This is perhaps required to merge the two values correctly in the register file itself. Only then the value of `al` can be loaded for the next instruction. This is a form of false register dependency between `al` and `rax`, right? Same interpretation when using `imul` or `lea`, but different register writeback latencies. I thought this issue should exist on Skylake as well, but apparently not. – Hadi Brais Sep 14 '18 at 02:40
  • Right, the dep-chain situation is that each `mov rax, [mem]` starts a new (short) dep chain for RAX, consisting of the load-merges. (And the stores fork off from that, reading the results of the load-merges). Register-renaming means that each write to RAX is to a different physical register, so a load into RAX can write-back in the same cycle as an earlier merge into AL. There is no separate retirement register file (and even on Nehalem where there is, it would only be updated based on the last ROB entry to retire, so only one write-back to the architectural RAX would be needed). – Peter Cordes Sep 14 '18 at 02:50
  • The false dependency is that `mov al, [rdi]` depends on the old value of RAX, because it's a merge instead of renaming just AL separately from the high bytes of RAX like SnB and P6-family do. Think of it like `add al, [rdi]` or `add rax, [rdi]` as far as how the dependency works. – Peter Cordes Sep 14 '18 at 02:52
  • @PeterCordes I guess `al` could not be picked up from the bypass network because it can only be sent to the bypass network when it is also safe to write it to the physical register that is mapped to `rax`. Also the instruction's result cannot be held up at end of the execution unit. Once the result is available, I guess it has to go on the bypass network. There is not delay that. Therefore, the dispatch itself has to be delayed. I'm just guessing here. I'm not sure what is happening exactly. – Hadi Brais Sep 14 '18 at 02:57
  • microarchitecturally, the CPU is probably not sending around 8-bit AL over the bypass network; it's probably sending at least the 32-bit EAX merge result, if not the whole RAX. (As a power optimization, it might gate off the high bits to actually internally do the zero-extension implicitly. Anyway, I think we know byte-store uops can take just the low byte of a full register from the forwarding network.) My suspicion is some kind of sub-optimal scheduling where maybe ALU uops like imul are causing resource stalls in the merge uops, and we end up with a bubble on the store-data port. – Peter Cordes Sep 14 '18 at 03:03
  • This is really interesting, especially as it shows how small these penalties are (in the order of 2% at most), except for the naive clang code. Naivel, I would have expected a more serious slowdown for the `cmp`/`sete` version, but it seems like in first approximation the only thing that counts is avoiding a full dependency (BTW the diagram is really nice!). Would you mind providing the full test harness/how exactly you measured these numbers? Now I'm curious about testing this on several machines of different generations/market segments. I wonder if I still have some Atom lying around... – Matteo Italia Sep 14 '18 at 06:07
  • @MatteoItalia Here you go [https://github.com/hadibrais/BenchNASM](https://github.com/hadibrais/BenchNASM). It'd be nice if you share your results with us. If you want, I can make graphs. – Hadi Brais Sep 14 '18 at 17:03
  • @MatteoItalia Note that it's sufficient to record the number of core cycles as reported by `perf` even if you are running the code on different processors with different core dynamically changing frequencies. That's because most memory accesses hit in the L1 cache. So everything is in terms of core cycles. But the range of frequencies should be the similar. – Hadi Brais Sep 14 '18 at 17:08
  • 2
    Just chiming in to say that I didn't forget about this QA, it's just that I've been a little busy and I'd like to finish my tests a bit more systematically before calling it a day . Still, I can anticipate that, at least in the "full dependency" scenario almost all machines I tested behave pretty much in the same way, with minor differences about whether @PeterCordes' solution or the gcc one is actually better. – Matteo Italia Sep 18 '18 at 06:00
  • @MatteoItalia It'd be interesting to know if there are processors on which gcc's code performs (even slightly) better than Peter's. – Hadi Brais Sep 18 '18 at 06:20