So, it doesn't really have to be that way, there's a bit of history going on here. Try running
mov rax, -1 # 0xFFFFFFFFFFFFFFFF
mov eax, 0
print rax
On your favorite x86 desktop (print
being based on your environment/language/whatever). What you'll notice is that even though rax
started out with all ones, and you think you only wiped out the bottom 32bits, the print
statement prints zero! Writes to eax
completely wipe rax
. Why? That's awfully weird and unintuitive behavior. The reason is simple: Because it's much faster. Trying to maintain the higher values of rax
is an absolute pain when you keep writing to eax
.
Intel/AMD however, didn't realize this back when they originally decided to move onto 32bit, and made a fatal error that forever left al
/ah
to be nothing but a historical relic: When you write to al
or ah
, the other doesn't get clobbered! This does make more intuitive sense, and it was once a great idea in a 16bit era, because now you have twice as many registers, and you have a 32bit register! But, nowadays, with the move to an abundance of registers, we just don't need more registers anymore. What we really want are faster registers, and to push more GHz. From this point of view, every time you write to al
or ah
, the processor needs to preserve the other half, which is fundamentally just much more expensive. (Explanation on why, later)
Enough with the theory, let's get some real tests. Each testcase was tested three times. These tests were run on an Intel Core i5-4278U CPU @ 2.60GHz
Only rax: 1.067s, 1.072s, 1.097s
global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov rax, 5
mov rax, 5
mov rax, 6
mov rax, 6
mov rax, 7
mov rax, 7
mov rax, 8
mov rax, 8
dec ecx
jmp loop
exit:
ret
Only eax: 1.072s, 1.062s, 1.060s
global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov eax, 5
mov eax, 5
mov eax, 6
mov eax, 6
mov eax, 7
mov eax, 7
mov eax, 8
mov eax, 8
dec ecx
jmp loop
exit:
ret
Only ah: 2.702s, 2.748s, 2.704s
global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov ah, 5
mov ah, 6
mov ah, 6
mov ah, 7
mov ah, 7
mov ah, 8
mov ah, 8
dec ecx
jmp loop
exit:
ret
Only ah/al: 1.432s, 1.457s, 1.427s
global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov al, 5
mov ah, 6
mov al, 6
mov ah, 7
mov al, 7
mov ah, 8
mov al, 8
dec ecx
jmp loop
exit:
ret
ah and al, then eax: 1.117s, 1.084s, 1.082s
global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov al, 5
mov eax, 6
mov al, 6
mov ah, 7
mov eax, 7
mov ah, 8
mov al, 8
dec ecx
jmp loop
exit:
ret
(Note that these tests don't have to do with partial register stall, as I'm not reading eax
after writes to ah
. In reference to the comments on the main post.)
As you can see from the tests, using al/ah is much slower. Using eax/rax blow the other times out of the water, and, there is fundamentally no performance difference between rax and eax themselves. As discussed, the reason why is because eax/rax directly overwrite the entire register. However, using ah or al means that the other half needs to be maintained.
Now, if you wish, we can delve into the explanation of why it's more efficient to just wipe the register on every usage. On face value, it doesn't seem like it'll matter, just only update the bits that matter, right? What's the big deal?
Well, Modern CPU's are intelligent, they will very aggressively parallelize operations that the CPU knows can't interfere with eachother, but only when such parallelization is actually possible. For example, if you mov eax to ebx, then ebx to ecx, then ecx to edx, then the CPU cannot parallelize it, and it will run slower than usual. However if you write to eax, write to ebx, write to ecx, and write to edx, then the CPU can parallelize all of those operations, and it will run much faster than usual! Feel free to test this on your own.
Internally, the way this is implemented is by immediately starting to execute and calculate an instruction, even if earlier instructions are still in the midst of being executed. However, the primary restriction is the following:
- If an earlier instructions writes to some register A, and the current instruction reads from some register A, then the current instruction must wait until the earlier instruction as been completed in its entirety, which is what causes these kinds of slowdowns.
In our mov eax, 5
spam test, which took ~1 second, the CPU could aggressively run all of the operations in parallel, because none of the instructions read from anything anyway, they were all write-only. It only needs to ensure that the most recent write is the value that the register holds during any future reads (Which is easy, because even though the operations all occur in overlapping time periods, the one that was started the last will also finish the last).
In the mov ah, 5
spam test, it was a painful 2.7x slower than the mov eax, 5
spam test, because there's fundamentally no easy way to parallelize the operations. Each operation is marked as "reading from eax", since it depends on the previous value of eax
, and it's also marked as "writing to eax", because it modifies the value of eax
. If an operation must read from eax, it must occur after the previous operation has finished writing to eax. Thus, parallelization suffers dramatically.
Also, if you want to try on your own, you'll notice that add eax, 5
spamming and add ah, 5
spamming both take exactly the same amount of time (2.7s on my CPU, exactly the same as mov ah, 5
!). In this case, add eax, 5
is marked as "read from eax
", and as "write to eax
", so it receives exactly the same slowdown as mov ah, 5
, which must also both read and write to eax! The actual mov vs add doesn't matter, the logic gates will immediately connect the input to the output via the desired operation in a single tick of the ALU.
So, I hope that shows why eax's 64bit overwrite feature leads to times that are faster than ah's preservation system.
There are a couple more details here though, why did the ah/al swap test take a much faster 1.43 seconds? Well, most likely what's happening is that register renaming is helping with all of the "mov ah, 5; mov al, 5" writes. It looks like the CPU was intelligent enough to split "ah" and "al" their own full 64bit registers, since they use different parts of the "eax" register anyway. This thus allows the consecutive pair of ah
then al
operations to be made in parallel, saving significant time. If "eax" is ever read in its entirety, the CPU would need to coalesce the two "al" vs "ah" registers back into one register, causing a significant slowdown (Shown later). In the earlier "mov ah, 5"-only test, it wasn't possible to split eax into separate registers, be cause we used "ah" every single time anyway.
And, interestingly, if you look at the ah/al/eax test, you can see that it was almost as fast as the eax test! In this case, I'm predicting that all three got their own registers and the code was thus extremely parallelized.
Of course, as mentioned, attempting to read eax
anywhere in that loop is going to kill performance when ah/al will have to be coalesced, here's an example:
Times: 3.412s, 3.390s, 3.515s
global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov al, 5
xor eax, 5
mov al, 6
mov ah, 8
xor eax, 5
mov al, 8
dec ecx
jmp loop
exit:
ret
But, note that the above test doesn't have a proper control group as it uses xor instead of mov (E.g., What if just using "xor" is the reason why it's slow). So, here's a test to compare it to:
Times: 1.426s, 1.424s, 1.392s
global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov al, 5
xor ah, 5
mov al, 6
mov ah, 8
xor ah, 5
mov al, 8
dec ecx
jmp loop
exit:
ret
The above test coalesces very aggressively, which causes the horrible 3.4 seconds that's in-fact far slower than any of other tests. But, the al/ah test splits al/ah into two different registers and thus runs pretty fast, faster than only using ah because consecutive ah/al operations can be parallelized. So, that was a trade-off that Intel was willing to make.
As mentioned, and as seen, it just doesn't really matter whether you do xor
vs add
vs mov
, this above ah/al still takes 1.4 seconds, bitwise / add / mov all simply directly hook-up the input to the output with very few logic gates, it just doesn't matter which operation you use (However, mul and div will indeed be slower, that requires tougher computation and thus several micro-cycles).
The past two tests show the reported partial register stall, which to be honest I hadn't even considered at first. I first thought register renaming would help mitigate the problem, which they appear to do in the ah/al mixes and ah/al/eax mixes. However, reads to eax with dirty ah/al values are brutal because the processor now has to combine the ah/al registers. It looks like processor manufactures believed register renaming partial registers was still worth it though, which makes sense since most work with ah/al don't involve reads to eax, you would just read from ah/al if that was your plan. This way, tight loops that bit fiddle with with ah/al benefit greatly, and the only harm is a hiccup on the next use of eax (At which point ah/al are probably not going to be used anymore).
If Intel wanted, rather than the ah/al register renaming optimization giving 1.4 seconds, normal ah being 2.7 seconds, and register coalescing abuse taking 3.4 seconds, Intel could have not cared about register renaming and all of those tests would have been the exact same 2.7 seconds. But, Intel is smart, they know that there's code out there that will want to use ah and al a lot, but it's not common to find code that uses al and ah a lot, while also reading from the total eax all the time as well.
Overall, even in the case of no partial register stall, writes to ah
are still much slower than writes to eax
, which is what I was trying to get across.
Of course, results may vary. Other processors (Most likely very old ones) might have control bits to shut off half of the bus, which would allow the bus to act like a 16bit or 8bit bus when it needs to. Those control bits would have to be connected via logic gates along the input to the registers, which would slightly slow down any and all usage of the register since now that's one more gate to go through before the register can update. Since such control bits would be off the vast majority of the time (Since it's rare to mess with 8bit/16bit values), it looks like Intel decided not to do that (For good reason).