8

Probably this is all about not even micro- but nanooptimizations, but the subject interests me and I would like to know if there are any penalties when using non-native register sizes in long mode?

I've learned from various sources, that partial register updates (like ax instead of eax) can cause eflags stall and degrade performance. But I'm not sure about the long mode. What register size is considered native for this processor operation mode? x86-64 are still extensions to x86 architecture, thus I believe 32 bits are still native. Or am I wrong?

For example, instructions like

sub eax, r14d

or

sub rax, r14

have the same size, but may there be any penalties when using either of those? May there be any penalties when mixing register sizes in consecutive instructions like the below? (assuming high dword is zero in all cases)

sub ecx, eax
sub r14, rax
Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
Alexander Zhak
  • 8,751
  • 4
  • 38
  • 65
  • There are penalties for 16-bit accesses. Use of 32-bit registers and avoiding r8-r15 is OK and in fact often leads to smaller code size. – Iwillnotexist Idonotexist Oct 19 '16 at 21:34
  • 4
    Writing to the 32 bit register will automatically clear the top 32 bits, so avoids the partial update problem. – Jester Oct 19 '16 at 21:53
  • The EFLAGS register is heavily virtualized in modern processors. Like all registers are. Necessarily so, too many instructions modify it and that is a major damper on super-scalar execution. What is missing from your code is an instruction that actually *uses* the register. So there is no compelling reason for the processor to interlock it and stall the code you posted. Never ever take somebody opinion of how it should/could work. The only point of writing assembly code is to make it faster than a C compiler would. Measure. – Hans Passant Oct 19 '16 at 23:03
  • 1
    @HansPassant: You can't easily test for yourself on every microarchitecture you care about. This is why stuff like Agner Fog's guide is so valuable. If you tested on a Haswell CPU, you'd find no penalties for partial-flags or partial-register usage. You'd also find no partial-reg penalties on AMD CPUs of any vintage, or P4, or Silvermont. But there are serious penalties on older Intel CPUs, including Core2 and Nehalem. (I've tested myself on Core2 and Sandybridge, but not others; I'm taking Agner Fog's word for that, since you raise the subject of taking people's word for things :) – Peter Cordes Oct 20 '16 at 06:59
  • Correction: Haswell still has merge uops for AH, like Sandybridge. (But [IvB and later don't rename AL or AX separately from RAX at all](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to), according to my testing) – Peter Cordes Aug 14 '17 at 18:34

1 Answers1

9

May there be any penalties when mixing 32 and 64-bit register sizes in consecutive instructions?

No, writing to a 32-bit register always zero-extends to the full register, so x86-64 avoids any partial-register penalties for 32 and 64-bit instruction.

thus I believe 32 bits are still native.

Yes, the default operand-size is 32-bit for most instructions (other than PUSH/POP). 64-bit needs a REX prefix with the W bit set to 1. So prefer 32-bit for code-size reasons. This is why compilers use mov r32, imm32 for addresses of static data (since the default code-model requires that code and static data addresses are in the low 2GiB of virtual address space).

It was a design choice by AMD. They could have chosen the other way, and required a prefix to get 32-bit operand size. Since long mode is a separate mode, x86-64 machine code can be different from x86-32 machine code however it wants. AMD chose to minimize the differences so they could share as many transistors as possible in the decoders. Your conclusion is correct, but your reasoning is totally bogus.


partial register updates (like ax instead of eax) can cause eflags stall and degrade performance.

Partial-flag stalls are separate from partial-register stalls. They're handled similarly internally (the separately-renamed parts of EFLAGS have to be merged the same as a modified AX has to be merged with the unmodified upper bytes of EAX). But one doesn't cause the other.

# partial-reg stall
setcc   al           # leaves the upper 3 (or 7) bytes unmodified
add     edx, eax     # reads full EAX.  Older CPUs stall while merging

Zeroing EAX ahead of the flag-setting and setcc with xor eax,eax avoids the partial-register penalty entirely. (Core2/Nehalem stalls for fewer cycles than earlier CPUs, but does still stall for 2 or 3c while inserting a merging uop. Sandybridge doesn't stall at all while inserting the merging uop).

(Another summary of partial register penalties on different CPUs: Why doesn't GCC use partial registers?, saying basically the same thing).

AMD doesn't suffer from partial-register stalls when reading the full register later, but instead partial-register writes and reads have a false dependency on the full register. (AMD CPUs don't rename sub-registers separately in the first place. Intel P4 and Silvermont / Knight's Landing are the same way.)

Intel Haswell/Skylake (and maybe Ivybridge) don't rename al separately from rax at all, so they never need to merge low8 / low16 registers. But the setcc al has a false dependency on the old value. They do still rename and merge ah. (Details on HSW/SKL partial-reg performance.)


# partial flag stall when reading a flag that didn't come from
# the last instruction to write any flags.
clc
# edi and esi = one-past-the-end of dst and src
# ecx = -count
bigInt_add:
    mov   eax, [esi+ecx*4]
    adc   [edi+ecx*4], eax   # reads CF, partial flag stall on 2nd and later iterations
    inc   ecx                # writes all flags except CF
    jl    bitInt_add         # loop upwards towards zero

See this Q&A for more discussion about partial-flags issues on Intel pre-Sandybridge vs. Sandybridge.


See also Agner Fog's microarch pdf, and other links in the tag wiki for more details about all of this.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606