3

The introductory x86 asm literature I read just seems to stick with 32-bit registers (eax, ebx, etc) in all practical scenarios except to demonstrate the 64-bit registers as a thing that also exists. If 16-bit registers are mentioned at all, it is as a historical note explaining why the 32-bit registers have an 'e' in front of their names. Compilers seem equally disinterested in less-than-32-bit registers.

Consider the following C code:

int main(void) { return 511; }

Although main purports to return an int, in fact Linux exit status codes are 8-bit, meaning any value over 255 will be the least significant 8-bits, viz.

hc027@HC027:~$ echo "int main(void) { return 511; }" > exit_gcc.c
hc027@HC027:~$ gcc exit_gcc.c 
hc027@HC027:~$ ./a.out 
hc027@HC027:~$ echo $?
255

So we see that only the first 8-bits of int main(void)'s return value will be used by the system. Yet when we ask GCC for the assembly output of that same program, will it store the return value in an 8-bit register? Let's find out!

hc027@HC027:~$ cat exit_gcc.s
    .file   "exit_gcc.c"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $511, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609"
    .section    .note.GNU-stack,"",@progbits

Nope! It uses %eax, a very-much-32-bit register! Now, GCC is smarter than me, and maybe the return value of int main(void) is used for other stuff that and don't know where it's return value won't be truncated to the 8 least significant bits (or maybe the C standard decrees that it must return a for realsy, actual int no matter what its actual destiny)

But regardless of the efficacy of my specific example, the question stands. As far as I can tell, the registers under 32-bits are pretty much neglected by modern x86 assembly programmers and compilers alike. A cursory Google of "when to use 16-bit registers x86" returns no relevant answers. I'm pretty curious: is there any advantage to using the 8 and 16-bit registers in x86 CPUs?

Alex V
  • 3,006
  • 1
  • 24
  • 45
  • 1
    Write a program that uses a `uint8_t` type. Does this change the registers used? The domain of types generally does not change depending on what "is" used. – user2864740 Nov 30 '18 at 18:12
  • 1
    Also don't forget to enable optimization, especially `-Os` as using 8 bit produces smaller code. – Jester Nov 30 '18 at 18:13
  • 1
    So, when I change the type to `uint8_t` GCC reacts by just pushing the value onto the stack and eschewing general registers altogether. *However* when I use `uint8_t` **and** `-Os` GCC makes use of the 8-bit `AL` register!! – Alex V Nov 30 '18 at 18:19
  • 3
    When you write to a register smaller than a DWORD (32-bit) you run the risk of processors performing poorly because of a [potential register stall](https://stackoverflow.com/a/41574531/3857942) afterwards. Since `-O3` is generall optimizing for speed, it will try to avoid the partial register stall situation. When using `-Os` it will optimize for size. the encoding of `mov $51, %al` (2 bytes) is shorter than `mov $51, %eax` (5 bytes) so it was chosen instead, A shorter instruction doesn't necessarily mean better performing code. – Michael Petch Nov 30 '18 at 18:55
  • 1
    @MichaelPetch I had no idea about partial register stall, thanks! So, in summary, the reason *to* use partial registers is to marginally optimize for size, and the reason *not* to use them is partial register stall. Excellent! That answers my question if you want to add it as an answer. Thanks! – Alex V Nov 30 '18 at 19:01
  • 1
    some (very specific) calculations may be better fit for 16 or 8 bit registers, in such case, and if many iterations are happening, it may be better to still use those instead of 32b regs (but carefully crafting the code to not stall), but in last ~two decades the x86 has now SIMD extensions (MMX was out before 2000), so again huge part of those specific type of calculations can be even more accelerated by processing packed 8/16b values, making use of the original ones so specific, that I can't even recall example right now. – Ped7g Nov 30 '18 at 22:17
  • @Ped7g Thanks for the info! I've also since learned that even when optimizing speed GCC will often *read* 8/16-bit registers, because apparently partial register stall does not occur when merely reading from a register. Very interesting – Alex V Nov 30 '18 at 22:28

2 Answers2

3

So, it doesn't really have to be that way, there's a bit of history going on here. Try running

    mov rax, -1 # 0xFFFFFFFFFFFFFFFF
    mov eax, 0
    print rax

On your favorite x86 desktop (print being based on your environment/language/whatever). What you'll notice is that even though rax started out with all ones, and you think you only wiped out the bottom 32bits, the print statement prints zero! Writes to eax completely wipe rax. Why? That's awfully weird and unintuitive behavior. The reason is simple: Because it's much faster. Trying to maintain the higher values of rax is an absolute pain when you keep writing to eax.

Intel/AMD however, didn't realize this back when they originally decided to move onto 32bit, and made a fatal error that forever left al/ah to be nothing but a historical relic: When you write to al or ah, the other doesn't get clobbered! This does make more intuitive sense, and it was once a great idea in a 16bit era, because now you have twice as many registers, and you have a 32bit register! But, nowadays, with the move to an abundance of registers, we just don't need more registers anymore. What we really want are faster registers, and to push more GHz. From this point of view, every time you write to al or ah, the processor needs to preserve the other half, which is fundamentally just much more expensive. (Explanation on why, later)

Enough with the theory, let's get some real tests. Each testcase was tested three times. These tests were run on an Intel Core i5-4278U CPU @ 2.60GHz

Only rax: 1.067s, 1.072s, 1.097s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov rax, 5
mov rax, 5
mov rax, 6
mov rax, 6
mov rax, 7
mov rax, 7
mov rax, 8
mov rax, 8
dec ecx
jmp loop
exit:
ret

Only eax: 1.072s, 1.062s, 1.060s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov eax, 5
mov eax, 5
mov eax, 6
mov eax, 6
mov eax, 7
mov eax, 7
mov eax, 8
mov eax, 8
dec ecx
jmp loop
exit:
ret

Only ah: 2.702s, 2.748s, 2.704s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov ah, 5
mov ah, 6
mov ah, 6
mov ah, 7
mov ah, 7
mov ah, 8
mov ah, 8
dec ecx
jmp loop
exit:
ret

Only ah/al: 1.432s, 1.457s, 1.427s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov al, 5
mov ah, 6
mov al, 6
mov ah, 7
mov al, 7
mov ah, 8
mov al, 8
dec ecx
jmp loop
exit:
ret

ah and al, then eax: 1.117s, 1.084s, 1.082s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov al, 5
mov eax, 6
mov al, 6
mov ah, 7
mov eax, 7
mov ah, 8
mov al, 8
dec ecx
jmp loop
exit:
ret

(Note that these tests don't have to do with partial register stall, as I'm not reading eax after writes to ah. In reference to the comments on the main post.)

As you can see from the tests, using al/ah is much slower. Using eax/rax blow the other times out of the water, and, there is fundamentally no performance difference between rax and eax themselves. As discussed, the reason why is because eax/rax directly overwrite the entire register. However, using ah or al means that the other half needs to be maintained.


Now, if you wish, we can delve into the explanation of why it's more efficient to just wipe the register on every usage. On face value, it doesn't seem like it'll matter, just only update the bits that matter, right? What's the big deal?

Well, Modern CPU's are intelligent, they will very aggressively parallelize operations that the CPU knows can't interfere with eachother, but only when such parallelization is actually possible. For example, if you mov eax to ebx, then ebx to ecx, then ecx to edx, then the CPU cannot parallelize it, and it will run slower than usual. However if you write to eax, write to ebx, write to ecx, and write to edx, then the CPU can parallelize all of those operations, and it will run much faster than usual! Feel free to test this on your own.

Internally, the way this is implemented is by immediately starting to execute and calculate an instruction, even if earlier instructions are still in the midst of being executed. However, the primary restriction is the following:

  • If an earlier instructions writes to some register A, and the current instruction reads from some register A, then the current instruction must wait until the earlier instruction as been completed in its entirety, which is what causes these kinds of slowdowns.

In our mov eax, 5 spam test, which took ~1 second, the CPU could aggressively run all of the operations in parallel, because none of the instructions read from anything anyway, they were all write-only. It only needs to ensure that the most recent write is the value that the register holds during any future reads (Which is easy, because even though the operations all occur in overlapping time periods, the one that was started the last will also finish the last).

In the mov ah, 5 spam test, it was a painful 2.7x slower than the mov eax, 5 spam test, because there's fundamentally no easy way to parallelize the operations. Each operation is marked as "reading from eax", since it depends on the previous value of eax, and it's also marked as "writing to eax", because it modifies the value of eax. If an operation must read from eax, it must occur after the previous operation has finished writing to eax. Thus, parallelization suffers dramatically.

Also, if you want to try on your own, you'll notice that add eax, 5 spamming and add ah, 5 spamming both take exactly the same amount of time (2.7s on my CPU, exactly the same as mov ah, 5!). In this case, add eax, 5 is marked as "read from eax", and as "write to eax", so it receives exactly the same slowdown as mov ah, 5, which must also both read and write to eax! The actual mov vs add doesn't matter, the logic gates will immediately connect the input to the output via the desired operation in a single tick of the ALU.

So, I hope that shows why eax's 64bit overwrite feature leads to times that are faster than ah's preservation system.


There are a couple more details here though, why did the ah/al swap test take a much faster 1.43 seconds? Well, most likely what's happening is that register renaming is helping with all of the "mov ah, 5; mov al, 5" writes. It looks like the CPU was intelligent enough to split "ah" and "al" their own full 64bit registers, since they use different parts of the "eax" register anyway. This thus allows the consecutive pair of ah then al operations to be made in parallel, saving significant time. If "eax" is ever read in its entirety, the CPU would need to coalesce the two "al" vs "ah" registers back into one register, causing a significant slowdown (Shown later). In the earlier "mov ah, 5"-only test, it wasn't possible to split eax into separate registers, be cause we used "ah" every single time anyway.

And, interestingly, if you look at the ah/al/eax test, you can see that it was almost as fast as the eax test! In this case, I'm predicting that all three got their own registers and the code was thus extremely parallelized.

Of course, as mentioned, attempting to read eax anywhere in that loop is going to kill performance when ah/al will have to be coalesced, here's an example:

Times: 3.412s, 3.390s, 3.515s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov al, 5
xor eax, 5
mov al, 6
mov ah, 8
xor eax, 5
mov al, 8
dec ecx
jmp loop
exit:
ret

But, note that the above test doesn't have a proper control group as it uses xor instead of mov (E.g., What if just using "xor" is the reason why it's slow). So, here's a test to compare it to:

Times: 1.426s, 1.424s, 1.392s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov al, 5
xor ah, 5
mov al, 6
mov ah, 8
xor ah, 5
mov al, 8
dec ecx
jmp loop
exit:
ret

The above test coalesces very aggressively, which causes the horrible 3.4 seconds that's in-fact far slower than any of other tests. But, the al/ah test splits al/ah into two different registers and thus runs pretty fast, faster than only using ah because consecutive ah/al operations can be parallelized. So, that was a trade-off that Intel was willing to make.

As mentioned, and as seen, it just doesn't really matter whether you do xor vs add vs mov, this above ah/al still takes 1.4 seconds, bitwise / add / mov all simply directly hook-up the input to the output with very few logic gates, it just doesn't matter which operation you use (However, mul and div will indeed be slower, that requires tougher computation and thus several micro-cycles).


The past two tests show the reported partial register stall, which to be honest I hadn't even considered at first. I first thought register renaming would help mitigate the problem, which they appear to do in the ah/al mixes and ah/al/eax mixes. However, reads to eax with dirty ah/al values are brutal because the processor now has to combine the ah/al registers. It looks like processor manufactures believed register renaming partial registers was still worth it though, which makes sense since most work with ah/al don't involve reads to eax, you would just read from ah/al if that was your plan. This way, tight loops that bit fiddle with with ah/al benefit greatly, and the only harm is a hiccup on the next use of eax (At which point ah/al are probably not going to be used anymore).

If Intel wanted, rather than the ah/al register renaming optimization giving 1.4 seconds, normal ah being 2.7 seconds, and register coalescing abuse taking 3.4 seconds, Intel could have not cared about register renaming and all of those tests would have been the exact same 2.7 seconds. But, Intel is smart, they know that there's code out there that will want to use ah and al a lot, but it's not common to find code that uses al and ah a lot, while also reading from the total eax all the time as well.

Overall, even in the case of no partial register stall, writes to ah are still much slower than writes to eax, which is what I was trying to get across.

Of course, results may vary. Other processors (Most likely very old ones) might have control bits to shut off half of the bus, which would allow the bus to act like a 16bit or 8bit bus when it needs to. Those control bits would have to be connected via logic gates along the input to the registers, which would slightly slow down any and all usage of the register since now that's one more gate to go through before the register can update. Since such control bits would be off the vast majority of the time (Since it's rare to mess with 8bit/16bit values), it looks like Intel decided not to do that (For good reason).

Nicholas Pipitone
  • 3,479
  • 1
  • 19
  • 34
  • Note that I'm not saying compilers should always use both AH and AL on old CPUs; writes to those registers probably couldn't pair with each other on P5. If you can get the same thing done in the same number of instructions using DWORD registers, it's pretty much never worse on any CPU, and often at least slightly better on some. So overall your recommendation to use DWORD registers when possible and avoid AH is good. But your reasoning is bogus, as are your claims about AH being inherently slow on *all* CPUs. – Peter Cordes Dec 03 '18 at 20:39
  • @PeterCordes Hey man, I didn't say it's slow on all CPUs. I already put a note that results may vary, and posted my CPU so that others may compare. I already mentioned that this can be optimized, and won't necessarily be true. Obviously, tests on an Intel prove nothing about an AMD. I'm never been saying you're wrong; you've brought up many correct statements. All I do state is that, even with no register renaming whatsoever, it is definitely harder for the CPU to use `ah` as opposed to `eax`. The CPU can either let it be slower, or dedicate transistors and logic to fix the problem. – Nicholas Pipitone Dec 03 '18 at 21:53
  • I explained why `xor eax, 5` is slow, exactly as you've commented just now. I don't understand what you're trying to disprove here. Did I say something wrong? I said it was partial register stall. Is it not partial register stall? It reads eax, and has to merge the out-of-date ah. This is partial register stall, so I don't know what you're trying to say is inaccurate there. To quote from the link mentioned earlier, "Partial register stall is a problem that occurs when we write to part of a 32-bit register and later read from the whole register or a bigger part of it.". – Nicholas Pipitone Dec 03 '18 at 21:58
  • I was forgetting exactly what your answer actually said, sorry. Only a couple things are actually wrong: "so that left a decade where eax was definitely faster" is not correct: on in-order CPUs without register renaming, writing to AL was exactly equivalent to writing to EAX. For a RMW operation like `add eax, 5` that has to read EAX anyway, `add al, 5` is exactly the same cost on Haswell and later, and on AMD. So byte regs aren't always slower, they're just pretty much never faster. There are some cases where you can save instructions by using them, though, for a net speed gain. – Peter Cordes Dec 03 '18 at 22:07
  • @PeterCordes Yes, RMW are just as fast for ah and eax, I never said otherwise. What I specifically mentioned in the second paragraph of my answer were writes. Not reads. The issue is, that writing to "ah" requires an unnecessary read. That's all I said. – Nicholas Pipitone Dec 03 '18 at 22:13
  • You can see that it's a precisely a read because my CPU executes "add ah, 5" and "add eax, 5" in 2.7s (2.673s, 2.786s, 2.804s) & (2.678s, 2.661s, 2.691s) resp. This is almost exactly the time to execute "mov ah, 5". While it's possible that by luck, the renaming procedure takes the same time as a read, I find this unlikely. The times are simply too close, and we can probably assume here that they're doing exactly the same thing. Additionally, the ah/al interleave should take just as long if renaming took a significant percentage of the time. The time must be spent on reads. – Nicholas Pipitone Dec 03 '18 at 22:17
  • In later paragraphs you made more blanket statements like "*So now we see that pre register renaming, al/ah was slow, and post register renaming...., it's still slow!*", which isn't really true. Without OoO exec, there's nearly no downside to partial regs, because the "read" is free. BTW, you wondered about the renamer running out of physical registers. In some cases that can be the limiting factor for out-of-order exec's reordering window, rather than ROB (reorder buffer) size. http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ – Peter Cordes Dec 03 '18 at 22:18
  • To quote from your other answer, "On Haswell and Skylake, everything I've tested so far supports this model: AL is never renamed separately from RAX... So if you never touch the high8 registers (AH/BH/CH/DH), everything behaves exactly like on a CPU with no partial-reg renaming (e.g. AMD)." However, my tests show that using "al" instead of "ah" take times: (2.651s, 2.688s, 2.721s). I believe it is reading from EAX here. Or, at the minimum, that statement is false (Since "mov eax, 5" is much faster than "mov al, 5"). I'm not being accusatory there, I'm simply stating what I'm able to benchmark. – Nicholas Pipitone Dec 03 '18 at 22:18
  • Yeah, OoO exec could definitely be a tighter bottleneck than reordering. I referenced the "heavy load on the OoO executor" early on, since this would appear to be the most significant issue. – Nicholas Pipitone Dec 03 '18 at 22:21
  • re: `add ah,5` vs. `add eax,5` vs. `add al,5`: yes, each of those has 1 cycle latency so they all perform the same (assuming no other ops involving RAX that could require partial-reg merging if there had been partial-reg renaming). In the `add ah,5` case, the RMW dependency chain is on the separately-renamed AH. In the AL and EAX cases, it's on EAX. Notice that mixing `add eax,5` with `add al,5` doesn't introduce any extra merging stalls, just the normal dependency-chain. Same for `mov al,5`, because that's a merge on Haswell/Skylake, not split off into a separate dependency chain like P6. – Peter Cordes Dec 03 '18 at 22:21
  • I was curious so I checked Agner Fog's guide about optimizing for P5 Pentium (in-order dual-issue superscalar). Like I thought, partial registers count exactly the same as whole registers, and an instruction that writes a register can't pair with an instruction that reads or writes the same register. So `mov ah, 5` / `mov cl, al` can't pair for the same reason that `mov eax, 1` / `mov ecx, eax`can't pair. https://notendur.hi.is/hh/kennsla/sti/h96/pentopt.txt is an old version with just P5 stuff, or see the P5 section of Agner's microarch pdf (https://agner.org/optimize/). – Peter Cordes Dec 03 '18 at 23:02
  • Other than that very localized (adjacent instruction) pairing issue, there's no penalty for writing or reading partial registers on P5 Pentium, which is the point I was trying to make earlier. The CPU can't exploit instruction-level parallelism between independent instructions outside of adjacent pairs because it doesn't do out-of-order execution. (That's a bit of an overstatement for instructions with more than 1 cycle latency, like loads. But using low8 regs doesn't create new problems on P5 that you could avoid with `movzx` to a wider reg. Only with OoO exec / reg renaming.) – Peter Cordes Dec 03 '18 at 23:09
  • @PeterCordes Ah, I see. Yes I'll remove the comments about register renaming. I originally was talking about the possibility of renaming ah/al from eax to enhance speed, as a way to prevent the issues I discussed. I was not referring to ah being slower on a processor without OoO execution, not realizing that Pentium P5 did not have OoO execution (Since I thought that was a central aspect of Pentium, given the dozen versions of Pentium that have it). I was referring to ah being slower on a processor that isn't renaming ah. – Nicholas Pipitone Dec 03 '18 at 23:53
  • Too much confusion between "register renaming" in reference to renaming "ah", vs "register renaming" as equivalent to "OoO execution" (Equivalent meaning processors having both or having neither), along with me generalizing "Pentium". In many cases I was referring to the former and you were referring to the latter, but in these cases I was definitely using the bad terminology there as "register renaming" is pretty clear. – Nicholas Pipitone Dec 04 '18 at 00:01
  • Oh right, Intel kept Pentium as a brand name long after the [P5 *microarchitecture*](https://en.wikipedia.org/wiki/P5_(microarchitecture)). That's why I always say P5 Pentium (as in i586), to distinguish from Pentium/Celeron branded Haswell/Skylake chips. (Modern CPUs cheaper than i3). Pentium II / III, and Pentium 4 are all derived from Pentium Pro (P6). – Peter Cordes Dec 04 '18 at 00:39
  • The right terminology for renaming AH separately from AL (or from the whole of RAX) is *partial register renaming*. Normal register-renaming (Tomasulo's algorithm) is essential to avoid write-after-write and write-after-read hazards for out-of-order execution. https://en.wikipedia.org/wiki/Register_renaming. And see the top of [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?](https://stackoverflow.com/a/45114487) for an example of where it breaks what would otherwise be a loop-carried dependency. – Peter Cordes Dec 04 '18 at 00:42
  • Fun fact: some early CPUs had limited OoO exec without register renaming, just scoreboarding. https://en.wikipedia.org/wiki/Scoreboarding. The terms are in no way equivalent. – Peter Cordes Dec 04 '18 at 00:48
  • Argh, someone deleted the early comments with links to [How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent](https://stackoverflow.com/q/45660139) :( You might want to edit that link into your answer, because it explains your benchmark results. – Peter Cordes Dec 04 '18 at 01:00
0

There are two practical uses of int8_t and uint8_t. It saves memory, which is important not because a mainstream computer will run out, but because it lets more data fit into your CPU’s cache. And you also sometimes need to exactly specify your layout in memory, such as for a device driver or packet header.

The instructions themselves are no faster (as Nicholas Pipitone’s wonderful answer shows) and might need more or fewer bytes to encode. In a few circumstances, you might be able to improve your register allocation.

Davislor
  • 12,287
  • 2
  • 26
  • 36