Why doesn't GCC use partial registers?

Question

Disassembling write(1,"hi",3) on linux, built with gcc -s -nostdlib -nostartfiles -O3 results in:

ba03000000     mov edx, 3 ; thanks for the correction jester!
bf01000000     mov edi, 1
31c0           xor eax, eax
e9d8ffffff     jmp loc.imp.write

I'm not into compiler development but since every value moved into these registers are constant and known compile-time, I'm curious why doesn't gcc uses dl, dil, and al instead. Some may argue that this feature won't make any difference in performance but there's a big difference in executable size between mov $1, %rax => b801000000 and mov $1, %al => b001 when we are talking about thousands of register accesses in a program. Not only small size if part of a software's elegance, it does have effect on performance.

Can someone explain why did "GCC decide" that it doesn't matter?

if you only load partial registers the rest will contain random garbage and the callee will use the whole register (as appropriate for the data type). Also it causes partial register stalls. Note that writing the low 32 bits will however zero the top 32 bits automatically. PS: you disassembled wrong, all of those instructions are actually 32 bit (no rex prefix). — Jester, Jan 10 '17 at 16:28
It doesn't have anything to do with GCC, every C compiler is required to do this. Google "C integer promotion" to learn more. — Hans Passant, Jan 10 '17 at 16:58
@HansPassant Does Integer promotion work for function arguments of prototyped functions? As far I can tell from the standard only the [default argument promotions](https://stackoverflow.com/questions/1255775/default-argument-promotions-in-c-function-calls) apply for function calls. Quoting: "*The integer promotions are applied only: as part of the usual arithmetic conversions, to certain argument expressions [ndr: the default arg promotions of above], to the operands of the unary +, -, and ~ operators, and to both operands of the shift operators, as specified by their respective subclauses*" — Margaret Bloom, Jan 10 '17 at 17:06
@MargaretBloom The value passed an argument is converted by assignment to the type of argument. See paragraph 7. Either way this means that the constants `3` and `1`, which are already `signed int`, remain as `signed int`. — Ross Ridge, Jan 10 '17 at 17:51
@RossRidge Yes, but does assignment performs integer promotion? From my understanding, the answer seems to be no. — Margaret Bloom, Jan 10 '17 at 17:55
@MargaretBloom No, but I don't think quibbling about where Hans Passant was using the correct terminology is all that helpful here. I should correct my last comment though, the constant `3` will either remain as `signed int` or be converted to `size_t` depending on whether there's a prototype. — Ross Ridge, Jan 10 '17 at 18:02
@RossRidge Of course, that was just total nitpicking in this context. I've asked because I stumbled on [this weird case](https://godbolt.org/g/CYsYIo) and I'm considering if it's worth asking about. — Margaret Bloom, Jan 10 '17 at 18:07
@MargaretBloom For what its worth the `xor eax, eax` indicates that the call was made without a prototype in scope. It doesn't know whether the function is varargs or not, so it sets AL to 0 indicate 0 arguments passed in SSE registers. Your weird case is really an ABI question, the "as if" rule allows either implementation so long as both ends agree on it. — Ross Ridge, Jan 10 '17 at 18:17

score 36 · Accepted Answer · edited Dec 02 '17 at 05:23

Partial registers entail a performance penalty on many x86 processors because they are renamed into different physical registers from their whole counterpart when written. (For more about register renaming enabling out-of-order execution, see this Q&A).

But when an instruction reads the whole register, the CPU has to detect the fact that it doesn't have the correct architectural register value available in a single physical register. (This happens in the issue/rename stage, as the CPU prepares to send the uop into the out-of-order scheduler.)

It's called a partial register stall. Agner Fog's microarchitecture manual explains it pretty well:

6.8 Partial register stalls (PPro/PII/PIII and early Pentium-M)

Partial register stall is a problem that occurs when we write to part of a 32-bit register and later read from the whole register or a bigger part of it.
Example:
; Example 6.10a. Partial register stall
mov al, byte ptr [mem8]
mov ebx, eax ; Partial register stall
This gives a delay of 5 - 6 clocks. The reason is that a temporary register has been assigned to AL to make it independent of AH. The execution unit has to wait until the write to AL has retired before it is possible to combine the value from AL with the value of the rest of EAX.

Behaviour in different CPUs:

Intel early P6 family: see above: stall for 5-6 clocks until the partial writes retire.
Intel Pentium-M (model D) / Core2 / Nehalem: stall for 2-3 cycles while inserting a merging uop. (see this Q&A for a microbenchmark writing AX and reading EAX with or without xor-zeroing first)
Intel Sandybridge: insert a merging uop for low8/low16 (AL/AX) without stalling, or for AH/BH/CH/DH while stalling for 1 cycle.
Intel IvyBridge (maybe), but definitely Haswell / Skylake: AL/AX aren't renamed, but AH still is: How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent.
All other x86 CPUs: Intel Pentium4, Atom / Silvermont / Knight's Landing. All AMD (and Via, etc):

Partial registers are never renamed. Writing a partial register merges into the full register, making the write depend on the old value of the full register as an input.

Without partial-register renaming, the input dependency for the write is a false dependency if you never read the full register. This limits instruction-level parallelism because reusing an 8 or 16-bit register for something else is not actually independent from the CPU's point of view (16-bit code can access 32-bit registers, so it has to maintain correct values in the upper halves). And also, it makes AL and AH not independent. When Intel designed P6-family (PPro released in 1993), 16-bit code was still common, so partial-register renaming was an important feature to make existing machine code run faster. (In practice, many binaries don't get recompiled for new CPUs.)

That's why compilers mostly avoid writing partial registers. They use movzx / movsx whenever possible to zero- or sign-extend narrow values to a full register to avoid partial-register false dependencies (AMD) or stalls (Intel P6-family). Thus most modern machine code doesn't benefit much from partial-register renaming, which is why recent Intel CPUs are simplifying their partial-register renaming logic.

As @BeeOnRope's answer points out, compilers still read partial registers, because that's not a problem. (Reading AH/BH/CH/DH can add an extra cycle of latency on Haswell/Skylake, though, see the earlier link about partial registers on recent members of Sandybridge-family.)

Also note that write takes arguments that, for an x86-64 typically configured GCC, need whole 32-bit and 64-bit registers so it couldn't simply be assembled into mov dl, 3. The size is determined by the type of the data, not the value of the data.

Finally, in certain contexts, C has default argument promotions to be aware of, ~~though this is not the case~~.
Actually, as RossRidge pointed out, the call was probably made without a visible prototype.

Your disassembly is misleading, as @Jester pointed out.
For example mov rdx, 3 is actually mov edx, 3, although both have the same effect—that is, to put 3 in the whole rdx.
This is true because an immediately value of 3 doesn't require sign-extension and a MOV r32, imm32 implicitly clears the upper 32 bits of the register.

"a temporary register has been assigned to AL to make it independent of AH". Why assign a "new" register for it, AL isn't physically a "subset" of EAX? Why does it have to be separated from EAX? — Ábrahám Endre, Jan 10 '17 at 17:31
To make more instructions execute in parallel. It's called [register renaming](https://en.wikipedia.org/wiki/Register_renaming), the Agner Fog manual I linked has more in-depth material than the Wikipedia article. Intel Optimisation manuals also cover this topic. — Margaret Bloom, Jan 10 '17 at 17:39
The above quotation from Agner Fog is for Netburst (Pentium 4). The quoted delay of 5 - 6 clocks is much better on later microarchitectures. For example from Sandy Bridge and Ivy Bridge, *The Ivy Bridge inserts an extra μop only in the case where a high 8-bit register (AH, BH, CH, DH) has been modified* — Olsonist, Mar 16 '17 at 05:59
Thanks @Olsonist, it is worth mentioning. If you don't mind I'll quote your comment in the answer. — Margaret Bloom, Mar 16 '17 at 10:54
Yes, the simple answer has nothing to do with register stalls (it is easy to find examples of compilers reading and writing the partial registers even at `-O2` and `-O3`), but that the x86 and x86-64 _requires_ that arguments smaller than 32-bits are zero or sign extended when passing them to functions. Regardless of what the prototype is, and even if there is no prototype in scope (the ABI still applies then, with some "default" function signature based on the shape of the call). Interestingly, the high 32-bits [can](http://stackoverflow.com/q/40475902) contain garbage, but not bits 8 to 32. — BeeOnRope, Mar 17 '17 at 23:12
@BeeOnRope What an interesting question you asked! I had the very same question a bit ago, when I posted [this comment](https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers/41574531?noredirect=1#comment70355715_41573502) above. — Margaret Bloom, Mar 18 '17 at 00:24
Oh yes, I hadn't clicked on it! I wrote an answer below that covers some of the nuances, but it links to two other questions that cover it in a lot of details. What I just discovered while playing around with this question though is that apparently `icc` doesn't respect the defacto rules, so actually `icc` and `clang` are incompatible on Linux and OSX too (I guess `icc` exists on OSX)? The short version is that while both `clang` and `gcc` respect the "zero extend rule", only `clang` _relies_ on it, as your example shows. `gcc` zero extends, but doesn't use that fact to optimize... — BeeOnRope, Mar 18 '17 at 00:45
BTW, my claim above is a bit too strong, I wrote it before I did the remaining research and found out that it is a bit of a weaker "defacto" standard than I thought, with `icc` not adhering and some debate about what the right thing to do is. The key to the OPs question though is that `gcc` does follow the extension-to-32-bits rule. — BeeOnRope, Mar 18 '17 at 00:47
I fixed several things that were wrong (the quote is for early P6, not Netburst), and added a table of how different CPUs handle partial registers. Anyway, I wanted to link to a good answer about partial registers in general, and now this is one, IMO :) — Peter Cordes, Dec 02 '17 at 05:25
@PeterCordes Thank you very much Peter! I really appreciate it — Margaret Bloom, Dec 02 '17 at 12:31

BeeOnRope · Answer 2 · 2017-08-01T02:24:49.230

In fact, gcc very often uses partial registers. If you look generated code, you'll find lots of cases where partial registers are used.

The short answer for your particular case, is because gcc always sign or zero-extends arguments to 32-bits when calling a C ABI function.

The de-facto SysV x86 and x86-64 ABI adopted by gcc and clang requires that parameters smaller than 32-bits are zero or sign-extended to 32-bits. Interestingly, they don't need to be extended all the way to 64-bit.

So for a function like the following on a 64-bit platform SysV ABI platform:

void foo(short s) {
 ...
}

... the argument s is passed in rdi and the bits of s will be as follows (but see my caveat below regarding icc):

  bits 0-31:  SSSSSSSS SSSSSSSS SPPPPPPP PPPPPPPP
  bits 32-63: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
  where:
  P: the bottom 15 bits of the value of `s`
  S: the sign bit of `s` (extended into bits 16-31)
  X: arbitrary garbage

The code for foo can depend on the S and P bits, but not on the X bits, which may be anything.

Similarly, for foo_unsigned(unsigned short u), you'd have 0 in bits 16-31, but it would otherwise be identical.

Note that I said defacto - because it actually isn't really documented what to do for smaller return types, but you can see Peter's answer here for details. I also asked a related question here.

After some further testing, I concluded that icc actually breaks this defacto standard. gcc and clang seem to adhere to it, but gcc only in a conservative way: when calling a function, it does zero/sign-extend arguments to 32-bits, but in its function implementations in doesn't depend on the caller doing it. clang implements functions that depend on the caller extending the parameters to 32-bits. So in fact clang and icc are mutually incompatible even for plain C functions if they have any parameters smaller than int.

score 1 · Answer 3 · answered Mar 16 '17 at 19:45

On something like the original IBM PC, if AH was known to contain 0 and it was necessary to load AX with a value like 0x34, using "MOV AL,34h" would generally take 8 cycles rather than the 12 required for "MOV AX,0034h"--a pretty big speed improvement (either instruction could execute in 2 cycles if pre-fetched, but in practice the 8088 spends most of its time waiting for instructions to be fetched at a cost of four cycles per byte). On the processors used in today's general-purpose computers, however, the time required to fetch code is generally not a significant factor in overall execution speed, and code size is normally not a particular concern.

Further, processor vendors try to maximize the performance of the kinds of code people are likely to run, and 8-bit load instructions aren't likely to be used nearly as often nowadays as 32-bit load instructions. Processor cores often include logic to execute multiple 32-bit or 64-bit instructions simultaneously, but may not include logic to execute an 8-bit operation simultaneously with anything else. Consequently, while using 8-bit operations on the 8088 when possible was a useful optimization on the 8088, it can actually be a significant performance drain on newer processors.

Why doesn't GCC use partial registers?

3 Answers3

Linked

Related