7

I started to learn assembler, and this does not looks logical to me.

Why can't I use multiple higher bytes in a register?

I understand the historical reason of rax->eax->ax, so let's focus on new 64-bit registers. For example, I can use r8 and r8d, but why not r8dl and r8dh? The same goes with r8w and r8b.

My initial thinking was that I can use 8 r8b registers at the same time (like I can do with al and ah at the same time). But I can't. And using r8b makes the complete r8 register "busy".

Which raises the question - why? Why would you need to use only a part of a register if you can't use other parts at the same time? Why not just keep only r8 and forget about the lower parts?

Remy Lebeau
  • 454,445
  • 28
  • 366
  • 620
nikitablack
  • 3,758
  • 1
  • 22
  • 47
  • 2
    How do you write single byte into memory with "only `r8`"? Besides that `r8b` does not make complete `r8` "busy", the upper 56 bits are still present, not sure what you makes you think otherwise. It's just not directly accessible as single 8b register, nothing else. And why there are no register aliases for higher bits: ever wondered how the instructions are encoded into machine code? Now add enough bits to encode all the new variations, and you have like +1B for every instruction = too expensive. Here's something relevant http://dsasmblr.com/accessing-and-modifying-upper-half-of-registers/ – Ped7g Aug 04 '17 at 07:26
  • I thought `mov BYTE PTR result, r8` writes single byte, isn't it? Another thing - if I use `r8b` I can't access upper 56 bits at the same time, they exist, but unaccessible, right? Also the question - why `ax` have lower and higher bits aliases but `eax` - not? And I have no idea how instructions are encoded into machine code, I have to read. Thank you. – nikitablack Aug 04 '17 at 07:37
  • 4
    Possible duplicate of [Why is there not a register that contains the higher bytes of EAX?](https://stackoverflow.com/questions/228200/why-is-there-not-a-register-that-contains-the-higher-bytes-of-eax) – phuclv Aug 04 '17 at 09:36
  • 2
    Allow writing to AH or a partial register incurs performance. That's the reason [why most x64 instructions zero the upper part of a 32 bit register](https://stackoverflow.com/q/11177137/995714) and [why sometimes modern compilers use add instead of inc](https://stackoverflow.com/q/36510095/995714) – phuclv Aug 04 '17 at 09:39
  • 2
    I would extend the answers by another point... usually you don't need direct access to upper bits of registers. If you have two 8 bit variables, you simply use two registers (`al, cl` for example). Using `al, ah` is sort of exploitation of original 8086 design to its full extent, and certainly it was handy sometimes when creating 256B intros, but for general compilers (and ~95+% of SW is produced by compilers) this is of little value, they have to have mechanism to manage shortage of spare registers anyway, so they can live with registers which are accessible only by certain size from bottom. – Ped7g Aug 04 '17 at 10:50
  • 2
    One more thing (tm): `mov BYTE PTR result, r8` ... well, we can discuss validity of such mnemonics (original Intel syntax would not like this, it's `mov [address],r8b` by Intel, but some smart assemblers may handle yours), but in the end it boils down to the instruction encoding, i.e. which instructions are known to the CPU. And the x86/x64 CPU can do either direct or `rip`-relative addressing (destination argument), but the size of affected memory is not part of this, or of the `mov [mem],r` instruction opcode, it's encoded in the source operand, which when `r8` used means 8 bytes to write. – Ped7g Aug 04 '17 at 11:02
  • @Ped7g: gcc will sometimes optimize some shifting/masking into `movzx ecx, ah` / `movzx edx, al`. (But note that reading AH has an extra cycle of latency in Skylake, so this is a throughput win but these days not a latency win vs. an extra shift by 8. Especially because `movzx edx, al` has 0 cycle latency on Skylake, handled in register-rename with no execution unit like mov edx, eax` is.) – Peter Cordes Aug 06 '17 at 07:35

3 Answers3

15

why can't I use multiple higher bytes in a register

Every permutation of an instruction needs to be encoded in the instruction. The original 8086 processor supports the following options:

instruction     encoding    remarks
---------------------------------------------------------
mov ax,value    b8 01 00    <-- whole register
mov al,value    b4 01       <-- lower byte
mov ah,value    b0 01       <-- upper byte

Because the 8086 is a 16 bit processor three different versions cover all options.
In the 80386 32-bit support was added. The designers had a choice, either add support for 3 additional sets of registers (x 8 registers = 24 new registers) and somehow find encodings for these, or leave things mostly as they were before.

Here's what the designers opted for:

instruction     encoding           remarks
---------------------------------------------------------
mov eax,value    b8 01 00 00 00    (same encoding as mov ax,value!)
mov ax,value     66 b8 01 00       (prefix 66 + encoding for mov eax,value)
mov al,value     (same as before)
mov ah,value     (same as before)

They simply added a 0x66 prefix to change the register size from the (now) default 32 to 16 bit plus a 0x67 prefix to change the memory operand size. And left it at that.

To do otherwise would have meant doubling the number of instruction encodings or add three six new prefixes for each of your 'new' partial registers.
By the time the 80386 came out all instruction bytes were already taken, so there was no space for new prefixes. This opcode space had been eaten up by useless instructions like AAA, AAD, AAM, AAS, DAA, DAS SALC. (These have been disabled in X64 mode to free up much needed encoding space).

If you want to change only the higher bytes of a register, simply do:

movzx eax,cl     //mov al,cl, but faster   
shl eax,24       //mov al to high byte.

But why not two (say r8dl and r8dh)

In the original 8086 there were 8 byte sized registers:

al,cl,dl,bl,ah,ch,dh,bh  <-- in this order.

The index registers, base pointer and stack reg do not have byte registers.

In the x64 this was changed. If there is a REX prefix (denoting x64 registers) then al..bh (8 regs) encode al..r15l. 16 regs incl. 1 extra encoding bit from the rex prefix. This adds spl, dil, sil, bpl, but excludes any xh reg. (you can still get the four xh regs when not using a rex prefix).

And using r8b makes the complete r8 "busy"

Yes, this is called a 'partial register write'. Because writing r8b changes part, but not all of r8, r8 is now split into two halves. One half has changed and one half has not. The CPU needs to join the two halves. It can either do this by using an extra CPU cycle to perform the work, or by adding more circuitry to the task to be able to do it in a single cycle.
The latter is expensive in terms of silicon and complex in terms of design, it also adds extra heat because of the extra work being done (more work per cycle = more heat produced). See Why doesn't GCC use partial registers? for a run-down on how different x86 CPUs handle partial-register writes (and later reads of the full register).

if I use r8b I can't access upper 56 bits at the same time, they exist, but unaccessible

No they are not unaccessible.

mov  rax,bignumber         //random value in eax
mov  al,0                  //clear al
xor  r8d,r8d               //r8=0
mov  r8b,16                //set r8b
or   r8,rax                //change r8 upper without changing r8b  

You use masks plus and, or, xor and not and to change parts of a register without affecting the rest of it.

There really was never a need for ah, but it did lead to more compact code on 8086 (and effectively more usable registers). It's still sometimes useful to write EAX or RAX and then read AL and AH separately (e.g. movzx ecx, al / movzx edx, ah) as part of unpacking bytes.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
Johan
  • 71,222
  • 23
  • 174
  • 298
  • please elaborate why 14nm CPUs are specific in that case. And why you donʼt take shuf* and pins* instructions into attention, as an example how itʼs done in SIMD subset? – Netch Aug 06 '17 at 06:28
  • Actually, on Intel Skylake `mov al, 123` has a dependency on the previous value of `rax`. R8b is not renamed separately from the rest of R8. I suspect that it's been this way since IvyBridge, when Agner Fog said that there were no more merging uops for using low-8 registers. `mov al, 123` has a throughput of 1 per clock unless you include a dep-breaking instruction. Intel does rename AH separately from the rest of RAX, but strangely `mov ah, 123` or `setne ah` still bottleneck at 1 per clock, while `mov ah, bl` can run 4 per clock. (These are all still independent of `inc al`, though.) – Peter Cordes Aug 06 '17 at 07:31
  • 1
    I wrote up my [partial-register experiments for Haswell/Skylake in a Q&A](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to). – Peter Cordes Oct 19 '17 at 19:57
  • Ok, I have a question to this, since I've just started with 64-bit assembly, and still don't know all of the calling conventions. So, if I take a register R8, and want to change just the lower bits of `R8D`for example, can I still refer to them as `R8BL`, and `R8BH` like we did with AX, BX, CX, and DX? – mutantkeyboard Feb 20 '20 at 10:25
4

The general answer is that such access is costly in a few senses and rarely needed.

Since at least second half of 1980s, and deeply since 1990s, instruction sets are modelled mainly for compiler convenience, than human convenience. A compiler logic is much simpler when it projects set of variables with its defined sizes (8, 16, 32, 64 bits) onto a fixed set of registers, and each register is used exactly for one value at a time. Register overlap is very confusing to them. As result, compiler internally knows a single register "A" (or even R0) that is AL, AX, EAX or RAX, depending on operand size. To use AH, it shall get into attention that AX consists of AH and AL, which is out of current sight. Even if it generates instructions with AH (e.g. LAHF), internally it is likely treated as "operation that fills A with LowFlags*256". (In real, there are some hacks that smear this strong picture, but they are very local.)

This is merged with other compiler specifics. For example, GCC and Clang are deeply SSA based. As result, you will never see XCHG instruction in their output; if you found it somewhere in code, it's 100% manual-written assembly insertion. The same for RCL, RCR, even if they are suitable in some specific cases (e.g. divide uint32 by 7), likely for ROL, ROR. If AMD had dropped RCL, RCR from their x86-64 design, nobody would really have mourned these instructions.

This does not include vector facility that is modelled on different principles and orthogonal to the main one. When compiler decides to do 4 parallel uint32 actions on an XMM register, it can use PINS* instructions to replace a part of such register or PEXTR* to extract it, but, in that case, it tracks 2-4-8-16... values at a moment. But such vectorization doesn't apply to the main register set, at least in main state-of-the-art ISAs.

This movement in compilers has been having an ongoing and strengthening moving in hardware. It's easier to make 16-32 independent architectural registers and track (see register renaming) them individually (e.g. add 2 register sources and provide 1 register result) than provide each part of register separately and count an instruction that (for the same example) gets 16 single-byte sources and generate 8 single-byte results. (Thatʼs why x86-64 is designed that an 32-bit register write clears upper 32 bits of 64-bit register; but this is not done for 8- and 16-bit operations, because CPU has already got need to combine with upper bits of previous register value, for legacy reasons.)

There are some chances to see this changed in some future before a radical CPU design revolution, but I treat them as really minimal.

If you currently need access to part of registers, like e.g. bits 40-47 of RAX, this can be quite easily implemented with copyings and rotations. To extract it:

MOV RCX, RAX ; expect result in CL
SHR RCX, 40
MOVZX RCX, CL ; to clear all bits except 7-0

To replace value:

ROR RAX, 40
MOV AL, CL ; provided that CL is what to insert
ROL RAX, 40

these code chunks are linear and fast enough.

Netch
  • 3,701
  • 1
  • 14
  • 26
  • 2
    On some Intel CPUs, `movzx` between two separate registers can run with zero latency and no execution port. So ideally you'd use a 3rd register and `MOVZX ECX, DL`. (There's never a reason to use 64-bit operand-size with MOVZX; writing ECX already zero-extends into RCX without needing a REX prefix.) Also, on some CPUs (like Intel Nehalem and earlier), `mov al,cl` will cause a partial-register stall when ROL reads RAX. Shifting that byte of RCX into place and using an `AND RAX, mask / OR RAX, RCX` avoids that, and shortens the dep chain involving RAX from 3 cycles to 2. – Peter Cordes Aug 09 '17 at 11:25
  • Your ROR / 8-bit-mov/ ROR sequence is compact, and fast on AMD, and Intel IvyBridge and later, though. – Peter Cordes Aug 09 '17 at 11:26
  • With BMI2, there's also a copy-and-rotate (by immediate) instruction: `rorx rdx, rax, 8` / `movzx ecx, dl`. (Note that `movzx rcx, cl` is a waste of a REX prefix. Let [implicit zero-extension from writing `ecx`](https://stackoverflow.com/questions/11177137/why-do-x64-instructions-zero-the-upper-part-of-a-32-bit-register) do its job. – Peter Cordes Dec 04 '17 at 07:40
3

There is another step in the history, the 8-bit 8080 that came before the 8086. Despite it being an 8-bit processor, you could use pairs of 8-bit registers to perform some 16-bit operations.

https://en.wikipedia.org/wiki/Intel_8080#Registers

So to make it easier to convert 8080 assembly code to 8086 code - which seemed important at the time (Intel even supplied a program to do that automatically, almost) - the new 16-bit registers were designed to optionally be used as pairs of 8-bit registers.

However, in the 8086 there were no features to use pairs of 16-bit registers for 32-bit operations, so when the 386 came around there didn't seem to be a need for splitting 32-bit registers into two 16-bit registers.

As Johan shows, the instruction set still provides a way to get two 8-bit registers from the lowest 16-bits. But this (mis)feature was not extended to higher widths.

Likewise, when moving to 64 bits there is no precedent of using pairs of 32-bit registers for 64-bit operations (except for some odd double shifts). And nobody tries to convert old assembly code anymore. Never worked that well anyway.

Bo Persson
  • 86,087
  • 31
  • 138
  • 198
  • See also [Why are first four x86 GPRs named in such unintuitive order?](https://retrocomputing.stackexchange.com/questions/5121/why-are-first-four-x86-gprs-named-in-such-unintuitive-order) for more about how 8086 register pairs evolved from 8080. – Peter Cordes Dec 04 '17 at 07:42