Why are clang and GCC not using xchg to implement std::swap?

Question

I have the following code:

char swap(char reg, char* mem) {
    std::swap(reg, *mem);
    return reg;
}

I expected this to compile down to:

swap(char, char*):
    xchg    dil, byte ptr [rsi]
    mov     al, dil
    ret

But what it actually compiles to is (at -O3 -march=haswell -std=c++20):

swap(char, char*):
    mov     al, byte ptr [rsi]
    mov     byte ptr [rsi], dil
    ret

See here for a live demo.

From the documentation of xchg, the first form should be perfectly possible:

XCHG - Exchange Register/Memory with Register

Exchanges the contents of the destination (first) and source (second) operands. The operands can be two general-purpose registers or a register and a memory location.

So is there any particular reason why it's not possible for the compiler to use xchg here? I have tried other examples too, such as swapping pointers, swapping three operands, swapping types other than char but I never get an xchg in the compile output. How come?

score 7 · Accepted Answer · answered Sep 02 '20 at 13:52

TL:DR: because compilers optimize for speed, not for names that sound similar. There are lots of other terrible ways they also could have implemented it, but chose not to.

xchg with mem has an implicit lock prefix (on 386 and later) so it's horribly slow. You always want to avoid it unless you need an atomic exchange, or are optimizing completely for code-size without caring at all for performance, in cases where you do want the result in the same register as the original value. Sometimes seen in naive (performance oblivious) hand-written Bubble as part of swapping 2 memory locations.

Possibly clang -Oz could go that crazy, IDK, but hopefully wouldn't in this case because your xchg way is larger code size, needing a REX prefix on both instructions to access DIL, vs. the 2-mov way being a 2-byte and a 3-byte instruction. clang -Oz does do stuff like push 1 / pop rax instead of mov eax, 1 to save 2 bytes of code size.

GCC -Os won't use xchg for swaps that don't need to be atomic because -Os still cares some about speed.

Also, IDK why would you think xchg + dependent mov would be faster or a better choice than two independent mov instructions that can run in parallel. (The store buffer makes sure that the store is correctly ordered after the load, regardless of which uop finds its execution port free first).

See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info

Seriously, I just don't see any plausible reason why you'd think a compiler might want to use xchg, especially given that the calling convention doesn't pass an arg in RAX so you still need 2 instructions. Even for registers, xchg reg,reg on Intel CPUs is 3 uops, and they're microcode uops that can't benefit from mov-elimination. (Some AMD CPUs have 2-uop xchg reg,reg. Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?)

I also guess you're looking at clang output; GCC will avoid partial register shenanigans (like false dependencies) by using a movzx eax, byte ptr [rsi] load even though the return value is only the low byte. Zero-extending loads are cheaper than merging into the old value of RAX. So that's another downside to xchg.

score 4 · Answer 2 · answered Sep 02 '20 at 13:51

So is there any particular reason why it's not possible for the compiler to use xchg here?

Because mov is faster than xchg and compilers optimize for speed.

See:

Why are clang and GCC not using xchg to implement std::swap?

XCHG - Exchange Register/Memory with Register

2 Answers2