The advantages of using 32bit registers/instructions in x86-64

Question

Sometimes gcc uses 32bit register, when I would expect it to use a 64bit register. For example the following C code:

unsigned long long 
div(unsigned long long a, unsigned long long b){
    return a/b;
}

is compiled with -O2 option to (leaving out some boilerplate stuff):

div:
    movq    %rdi, %rax
    xorl    %edx, %edx
    divq    %rsi
    ret

For the unsigned division, the register %rdx needs to be 0. This can be achieved by means of xorq %rdx, %rdx, but xorl %edx, %edx seems to have the same effect.

At least on my machine there was no performance gain (i.e. speed up) for xorl over xorq.

I have actually more than just one question:

Why does gcc prefer the 32bit version?
Why does gcc stop at xorl and doesn't use xorw?
Are there machines for which xorl is faster than xorq?
Should one always prefer 32bit register/operations if possible rather than 64bit register/operations?

If you `objdump -d` the created object file, you'll see that `xorq` requires an extra byte of encoding. See the x86 programmer's manual for details. — EOF, Jul 11 '16 at 09:27
It is just an optimization. Code size (which arguably is performance too, more stuff in the pipe more stuff in the cache). the x86 started off 16 bit then 32 extensions then 64. Some of these instructions depending on your tools may work with the same opcode on 32 or 64 bit. Sometimes it is just the disassembler misleading you, sometimes it is really a smaller register and zero extended or sign extended or whatever. Just read the x86 docs. — old_timer, Jul 11 '16 at 19:10
Also related: [64 bit assembly, when to use smaller size registers](//stackoverflow.com/q/6577458) — Peter Cordes, Dec 27 '19 at 11:11

Peter Cordes · Accepted Answer · 2020-01-09T15:19:32.600

Why does gcc prefer the 32bit version?

Mainly code size: no REX prefix needed in the machine-code encoding.

Why does gcc stop at xorl and doesn't use xorw?

Writing an 8 or 16-bit partial register doesn't zero-extend to the rest of the register. (Only writing a 32-bit register implicitly zero-extends to 64)

Besides, xorw requires an operand-size prefix to encode, so it's the same size as xorq, larger than xorl. 32-bit operand-size is the default in x86-64 machine code, no prefixes required. (For most instructions; a few like push/pop and call/jmp default to 64-bit, including memory-indirect call [rdi] = ff 17 with a pointer in memory.) 8-bit operand size uses separate opcodes, not prefixes, but still potentially has partial-register penalties.

See also Why doesn't GCC use partial registers? 32-bit registers are not considered partial registers, because writing them always writes the whole 64-bit register. (And it's writing partial regs that's the main problem, not reading them after a full-width write.)

Are there machines for which xorl is faster than xorq?

Yes, Silvermont / KNL only recognize xor-zeroing as a zeroing idiom (dependency breaking, and other good stuff) with 32-bit operand size. Thus, even though code-size is the same, xor %r10d, %r10d is much better than xor %r10, %r10. (xor needs a REX prefix for r10 regardless of operand-size).

On all CPUs, code size always potentially matters for decode and I-cache footprint (except when a later .p2align directive would just make more padding if the preceding code is smaller¹). There's no downside to using 32-bit operand size for xor-zeroing (or to implicit zero-extending in general instead of explict², including using AVX vpxor xmm0,xmm0,xmm0 to zero AVX512 zmm0.)

Most instructions are the same speed for all operand-sizes, because modern x86 CPUs can afford the transistor budget for wide ALUs. Exceptions include imul r64,r64 is slower than imul r32,r32 on AMD CPUs before Ryzen, and Intel Atom, and 64bit div is significantly slower on all CPUs. AMD pre-Ryzen has slower popcnt r64. Atom/Silvermont have slow shld/shrd r64 vs. r32. Mainstream Intel (Skylake etc.) have slower bswap r64.

Should one always prefer 32bit register/operations if possible rather than 64bit register/operations?

Yes, prefer 32-bit ops for code-size reasons at least, but note that using r8..r15 anywhere in an instruction (including an addressing mode) will also require a REX prefix. So if you have some data you can use 32-bit operand-size with (or pointers to 8/16/32-bit data), prefer to keep it in the low 8 named registers (e/rax..) rather than high 8 numbered registers.

But don't spend extra instructions to make this happen; saving a few bytes of code-size is usually the least important consideration. e.g. just use r8d instead of saving/restoring rbx so you can use ebx if you need an extra register that doesn't have to be call-preserved. Using 32-bit r8d instead of 64-bit r8 won't help with code-size, but it can be faster for some operations on some CPUs (see above).

This also applies to cases where you only care about the low 16 bits of a register, but it can still be more efficient to use a 32-bit add instead of 16-bit.

See also http://agner.org/optimize/ and the x86 tag wiki.

Footnote 1: There are rare use-cases for making instructions longer than necessary (What methods can be used to efficiently extend instruction length on modern x86?)

To align a later branch target without needing a NOP.
Tuning for the front-end of a specific microarchitecture (i.e. optimizing decode by controlling where instructions boundaries are). Inserting NOPs would cost extra front-end bandwidth and completely defeat the whole purpose.

Assemblers won't do this for you, and doing it by hand is time consuming to re-do every time you change anything (and you may have to use .byte directives to manually encode the instruction).

Footnote 2: I've found one exception to the rule that implicit zero-extension is at least as cheap as a wider operation: Haswell/Skylake AVX 128-bit loads being read by a 256-bit instruction have an extra 1c of store-forwarding latency vs. being consumed by a 128-bit instruction. (Details in a thread on Agner Fog's blog forum.)

Just for clarity: The REX-prefix is the the prefix of the instruction and not of the registers? — ead, Jul 12 '16 at 06:11
@ead: yes. See the Intel insn ref manual for details of the insn encoding. — Peter Cordes, Jul 12 '16 at 06:48
I spent some time benchmarking arithmetic-intensive code using 16-bit operands compared to 32-bit operands on various x86 architectures, and that operand-size prefix makes a *surprising* amount of difference. The 16-bit specialization was on the order of 50-100% slower than simply sign-extending the 16-bit value to 32 bits, using 32-bit instructions, and truncating the result. This is true from Pentium III all the way to Sandy Bridge. I was quite surprised, and I sort of wonder why compilers still bother emitting 16-bit instructions. I haven't found a case where they're faster. — Cody Gray, Jul 13 '16 at 10:50
I'm intrigued by your claim here that "using larger instructions instead of padding with a NOP is typically more efficient". I've never heard that piece of wisdom anywhere before. Is that something you discovered by testing, or it it documented somewhere? And any ideas on why that might be true? Is it just that the decoder is not optimized for the various NOP encodings, compared to more frequently used instructions? — Cody Gray, Jul 13 '16 at 10:54
@CodyGray: Agner Fog suggests using longer encoding for alignment in his Optimizing Assembly guide. It only applies in cases where the NOPs would be executed, like aligning the top of a loop that you enter by falling into it, rather than jumping into an entry point mid-loop. NOPs still take a slot in the decoders and as fused-domain uops in the uop cache and issue stage. They don't take an execution unit, but that's often not the bottleneck in modern CPUs with boatloads of execution units. — Peter Cordes, Jul 13 '16 at 19:30
re: 16bit code: 16bit immediate operands are horrible for the decoders in Intel CPUs, which matters a lot with no uop cache. Partial-register stalls are also a big problem if you're not careful. I'd consider it a missed-optimization any time a compiler uses a 16 bit operand-size when it could have got the same result with the same number of uops without 16 bit ops. Occasionally saving a byte of code-size in an immediate operand isn't worth the perf downside for some CPUs. — Peter Cordes, Jul 13 '16 at 19:34

score 15 · Answer 2 · edited Jul 12 '16 at 00:15

15

In 64bit mode writing to a 32bit register zeros the upper 32 bits => xorl %edx, %edx zeros the upper part of rdx for "free".

On the other hand xor %rdx, %rdx is encoded with an extra byte, because it needs a REX prefix. When trying to zero a 64 bit register, it is a clear win to xor it as 32 bit register.

edited Jul 12 '16 at 00:15

ead

27,136
4
67
108

answered Jul 11 '16 at 10:29

CALL-151

170
6

The advantages of using 32bit registers/instructions in x86-64

2 Answers2

Linked

Related