7

I can move data items stored in memory, to a general-purpose register of my choosing, using the MOV instruction.

MOV r8, [m8]
MOV r16, [m16]
MOV r32, [m32]
MOV r64, [m64]

Now, don’t shoot me, but how is the following achieved: MOV r24, [m24]? (I appreciate the latter is not legal).

In my example, I want to move the characters “Pip”, i.e. 0x706950h, to register rax.

section .data           ; Section containing initialized data

14      DogsName: db "PippaChips"
15      DogsNameLen: equ $-DogsName

I first considered that I could move the bytes separately, i.e. first a byte, then a word, or some combination thereof. However, I cannot reference the ‘top halves’ of eax, rax, so this falls down at the first hurdle, as I would end up over-writing whatever data was moved first.

My solution:

26    mov al, byte [DogsName + 2] ; move the character “p” to register al
27    shl rax, 16                 ; shift bits left by 16, clearing ax to receive characters “pi”
28    mov ax, word [DogsName]     ; move the characters “Pi” to register ax

I could just declare “Pip” as an initialized data item, but the example is just that, an example, I want to understand how to reference 24 bits in assembly, or 40, 48… for that matter.

Is there an instruction more akin to MOV r24, [m24]? Is there a way to select a range of memory addresses, as opposed to providing the offset and specifying a size operator. How to move 3 bytes from memory to register in ASM x86_64?

NASM version 2.11.08 Architecture x86

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
Andrew Hardiman
  • 840
  • 12
  • 25
  • 4
    for best performance you have your memory structures designed in such way, that you can read 4 bytes instead (without crash on invalid address). Then you do `mov eax,[m24]` `and eax,0x00FFFFFF` (if you need the b24-b63 of `rax` cleared). Often many algorithm even don't care about 4+th byte of `rax`, so you just keep calculating the 3 bytes as desired, and ignore the junk in upper bits of `rax`. Overall computer machines usually don't like 24-bit sizes, especially x86 family. Even "true color" graphics modes are rather 32b (with 8b padding waste) than 24b, for better performance. – Ped7g Dec 15 '17 at 12:29
  • 1
    for cases where memory compactness is important and data can't be padded to 4/8/... boundaries, your value composing with byte+word is correct. – Ped7g Dec 15 '17 at 12:32

3 Answers3

10

If you know the 3-byte int isn't at the end of a page, normally you'd do a 4-byte load and mask off the high garbage that came with the bytes you wanted, or simply ignore it if you're doing something with the data that doesn't care about high bits. Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?


Unlike stores1, loading data that you "shouldn't" is never a problem for correctness unless you cross into an unmapped page. (E.g. if db "pip" came at the end of a page, and the following page was unmapped.) But in this case, you know it's part of a longer string, so the only possible downside is performance if a wide load extends into the next cache line (so the load crosses a cache-line boundary). Is it safe to read past the end of a buffer within the same page on x86 and x64?

Either the byte before or the byte after will always be safe to access, for any 3 bytes (without even crossing a cache-line boundary if the 3 bytes themselves weren't split between two cache lines). Figuring this out at run-time is probably not worth it, but if you know the alignment at compile time, you can do either

mov   eax, [DogsName-1]     ; if previous byte is in the same page/cache line
shr   eax, 8

mov   eax, [DogsName]       ; if following byte is in the same page/cache line
and   eax, 0x00FFFFFF

I'm assuming you want to zero-extend the result into eax/rax, like 32-bit operand-size, instead of merging with the existing high byte(s) of EAX/RAX like 8 or 16-bit operand-size register writes. If you do want to merge, mask the old value and OR. Or if you loaded from [DogsName-1] so the bytes you want are in the top 3 positions of EAX, and you want to merge into ECX: shr ecx, 24 / shld ecx, eax, 24 to shift the old top byte down to the bottom, then shift it back while shifting in the 3 new bytes. (There's no memory-source form of shld, unfortunately. Semi-related: efficiently loading from two separate dwords into a qword.) shld is fast on Intel CPUs (especially Sandybridge and later: 1 uop), but not on AMD (http://agner.org/optimize/).


Combining 2 separate loads

There are many ways to do this, but there's no single fastest way across all CPUs, unfortunately. Partial-register writes behave differently on different CPUs. Your way (byte load / shift / word-load into ax) is fairly good on CPUs other than Core2/Nehalem (which will stall to inserting a merging uop when you read eax after assembling it). But start with movzx eax, byte [DogsName + 2] to break the dependency on the old value of rax.

The classic "safe everywhere" code that you'd expect a compiler to generate would be:

DEFAULT REL      ; compilers use RIP-relative addressing for static data; you should too.
movzx   eax, byte [DogsName + 2]   ; avoid false dependency on old EAX
movzx   ecx, word [DogsName]
shl     eax, 16
or      eax, ecx

This takes an extra instruction, but avoids writing any partial registers. However, on CPUs other than Core2 or Nehalem, the best option for 2 loads is writing ax. (Intel P6 before Core2 can't run x86-64 code, and CPUs without partial-register renaming will merge into rax when writing ax). Sandybridge does still rename AX, but the merge only costs 1 uop with no stalling, i.e. same as the OR, but on Core2/Nehalem the front-end stalls for about 3 cycles while inserting the merge uop.

Ivybridge and later only rename AH, not AX or AL, so on those CPUs, the load into AX is a micro-fused load+merge. Agner Fog doesn't list an extra penalty for mov r16, m on Silvermont or Ryzen (or any other tabs in the spreadsheet I looked at), so presumably other CPUs without partial-reg renaming also execute mov ax, [mem] as a load+merge.

movzx   eax, byte [DogsName + 2]
shl     eax, 16
mov      ax, word [DogsName]

; when read eax:
  ; * Sandybridge: extra 1 uop inserted to merge
  ; * core2 / nehalem: ~3 cycle stall (unless you don't use it until after the load retires)
  ; * everything else (including IvB+): no penalty, merge already done

Actually, testing alignment at run-time can be done efficiently. Given a pointer in a register, the previous byte is in the same cache line unless the last few 5 or 6 bits of the address are all zero. (i.e. the address is aligned to the start of a cache line). Lets assume cache lines are 64 bytes; all current CPUs use that, and I don't think any x86-64 CPUs with 32-byte lines exist. (And we still definitely avoid page-crossing).

    ; pointer to m24 in RSI
    ; result: EAX = zero_extend(m24)

    test   sil, 111111b     ; test all 6 low bits.  There's no TEST r32, imm8, so  REX r8, imm8 is shorter and never slower.
    jz   .aligned_by_64

    mov    eax, [rsi-1]
    shr    eax, 8
.loaded:

    ...
    ret    ; end of whatever large function this is part of

 ; unlikely block placed out-of-line to keep the common case fast
.aligned_by_64:
    mov    eax, [rsi]
    and    eax, 0x00FFFFFF
    jmp   .loaded

So in the common case, the extra cost is only one not-taken test-and-branch uop.

Depending on the CPU, the inputs, and the surrounding code, testing the low 12 bits (to only avoid crossing 4k boundaries) would trade off better branch prediction for some cache line splits within pages, but still never a page-line split. (In that case test esi, (1<<12)-1. Unlike testing sil with an imm8, testing si with an imm16 is not worth the LCP stall on Intel CPUs to save 1 byte of code. And of course if you can have your pointer in ra/b/c/dx, you don't need a REX prefix, and there's even a compact 2-byte encoding for test al, imm8.)

You could even do this branchlessly, but clearly not worth it vs. just doing 2 separate loads!

    ; pointer to m24 in RSI
    ; result: EAX = zero_extend(m24)

    xor    ecx, ecx
    test   sil, 7         ; might as well keep it within a qword if  we're not branching
    setnz  cl             ; ecx = (not_start_of_line) ? : 1 : 0

    sub    rsi, rcx       ; normally rsi-1
    mov    eax, [rsi]

    shl    ecx, 3         ; cl = 8 : 0
    shr    eax, cl        ; eax >>= 8  : eax >>= 0

                          ; with BMI2:  shrx eax, [rsi], ecx  is more efficient

    and    eax, 0x00FFFFFF  ; mask off to handle the case where we didn't shift.

True architectural 24-bit load or store

Architecturally, x86 has no 24-bit loads or stores with an integer register as the destination or source. As Brandon points out, MMX / SSE masked stores (like MASKMOVDQU, not to be confused with pmovmskb eax, xmm0) can store 24 bits from an MMX or XMM reg, given a vector mask with only the low 3 bytes set. But they're almost never useful because they're slow and always have an NT hint (so they write around the cache, and force eviction like movntdq). (The AVX dword/qword masked load/store instruction don't imply NT, but aren't available with byte granularity.)

AVX512BW (Skylake-server) adds vmovdqu8 which gives you byte-masking for loads and stores with fault-suppression for bytes that are masked off. (I.e. you won't segfault if the 16-byte load includes bytes in an unmapped page, as long as the mask bits aren't set for that byte. But that does cause a big slowdown). So microarchitecturally it's still a 16-byte load, but the effect on architectural state (i.e. everything except performance) is exactly that of a true 3-byte load/store (with the right mask).

You can use it on XMM, YMM, or ZMM registers.

;; probably slower than the integer way, especially if you don't actually want the result in a vector
mov       eax, 7                  ; low 3 bits set
kmovw     k1, eax                 ; hoist the mask setup out of a loop


; load:  leave out the {z} to merge into the old xmm0 (or ymm0 / zmm0)
vmovdqu8  xmm0{k1}{z}, [rsi]    ; {z}ero-masked 16-byte load into xmm0 (with fault-suppression)
vmovd     eax, xmm0

; store
vmovd     xmm0, eax
vmovdqu8  [rsi]{k1}, xmm0       ; merge-masked 16-byte store (with fault-suppression)

This assembles with NASM 2.13.01. IDK if your NASM is new enough to support AVX512. You can play with AVX512 without hardware using Intel's Software Development Emulator (SDE)

This looks cool because it's only 2 uops to get a result into eax (once the mask is set up). (However, http://instlatx64.atw.hu/'s spreadsheet of data from IACA for Skylake-X doesn't include vmovdqu8 with a mask, only the unmasked forms. Those do indicate that it's still a single uop load, or micro-fused store just like a regular vmovdqu/a)

But beware of slowdowns if a 16-byte load would have faulted or crossed a cache-line boundary. I think it internally does do the load and then discards the bytes, with a potentially-expensive special case if a fault needs to be suppressed.

Also, for the store version, beware that masked stores don't forward as efficiently to loads. (See Intel's optimization manual for more).


Footnotes:

  1. Wide stores are a problem because even if you replace the old value, you'd be doing a non-atomic read-modify-write, which could break things if that byte you put back was a lock, for example. Don't store outside of objects unless you know what comes next and that it's safe, e.g. padding that you put there to allow this. You could lock cmpxchg a modified 4-byte value into place, to make sure you're not stepping on another thread's update of the extra byte, but obviously doing 2 separate stores is much better for performance than an atomic cmpxchg retry loop.
Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • Wow, this is an amazing answer, thank you so much. I'm going to need some time to digest! – Andrew Hardiman Dec 15 '17 at 15:52
  • 1
    @case_2501: thanks for the interesting question. Simple but with plenty of interesting optimization and CPU-architecture (and silly CPU tricks) stuff to talk about, especially because of microarchitectural differences in partial-register behaviour on different CPUs. That's what made it interesting to start thinking about in the first place. – Peter Cordes Dec 15 '17 at 15:55
  • And BTW, some / most of it is hopefully understandable for anyone with basic asm knowledge, but much of the interesting (to me) stuff to talk about here is subtle / advanced micro-arch optimization stuff. I didn't leave it out just because the question didn't mention performance :) I think I wrote a whole paragraph about saving 2 bytes :P (When all else is equal, like it is for `test sil, imm8` vs. `test esi, imm32`, smaller is better) – Peter Cordes Dec 15 '17 at 16:04
  • no, it's great, thank you. I'm trying to digest as much as possible. It's always useful to have further reading/resources; I'm currently about a third of the way through Jeff Duntemann, it's indispensable to be able to elaborate and think through specific elements in more detail. – Andrew Hardiman Dec 15 '17 at 16:28
  • 1
    @case_2501: cool. I think learning asm in the first place is all about getting into the details of how things work, so many people interested in that are probably also interested in further details :) And BTW, there are lots of good links to official docs and good guides in [the x86 tag wiki](https://stackoverflow.com/tags/x86/info) – Peter Cordes Dec 15 '17 at 16:38
2

The only way to write 24 bits is to use MMX (MASKMOVQ) or SSE (MASMODQU) and masks to prevent bytes you don't want modified from being modified. However, for a single write, MMX and SSE are excessively complicated (and likely slower).

Note that normally reads are cheaper than writes (especially when multiple CPUs are involved). With this in mind, an alternative would be:

    shl eax,8
    mov al,[DogsName+3]
    ror eax,8
    mov [DogsName],eax

This overwrites the byte after with its old value (and may potentially cause problems if the byte after is inaccessible, or if the byte after belongs to anything that needs to be updated atomically).

Brendan
  • 26,293
  • 1
  • 28
  • 50
  • Beware that `MASKMOVQ` and `MASKMOVDQU` perform *stores*. – Margaret Bloom Dec 15 '17 at 13:19
  • Great point with `maskmov` for writes; that does make it architecturally possible to truly do a 24-bit store without any non-atomic read / modify / write-back, and with fault-suppression. But they are implicitly NT stores, so are unusably slow in normal cases. But AVX512BW can do byte-masked loads/stores with `vmovdqu8 xmm0{k1}{z}, [rsi]` for example. – Peter Cordes Dec 15 '17 at 13:21
  • `eax` may contain data in the highest byte. Better clear it first. – Jongware Dec 15 '17 at 14:40
  • 1
    @usr2564301: This snippet is storing the low 3 bytes of the initial `eax` to `[DogsName+0..2]`, with a non-atomic read/rewrite of `[DogsName+3]`. This is the opposite of what the question asked for. – Peter Cordes Dec 15 '17 at 15:41
  • I don't think merge-load / store is likely to be better than 2 stores, unless maybe you had already loaded something else from that cache line. This way makes the store data dependent on the load, but just doing 2 stores back to back would put them in the store buffer where they can eventually commit to L1D whenever the CPU eventually gets its hands on the cache line. Loads are cheaper than stores in general, but extra dependencies on lines you didn't need to load from are not great. – Peter Cordes Dec 15 '17 at 15:45
  • And two *adjacent* stores (especially into the same cache line) is only bad if you bottleneck on store uops (not memory bandwidth, just store instructions, i.e. 1 store per clock). – Peter Cordes Dec 15 '17 at 15:45
0

With BMI2 you can use BZHI

BZHI r32a, r/m32, r32b   Zero bits in r/m32 starting with the position in r32b, write result to r32a
BZHI r64a, r/m64, r64b   Zero bits in r/m64 starting with the position in r64b, write result to r64a

So to load the low 24 bits from [mem] you can use

MOV  eax, 24
BZHI eax, [mem], eax

With this you can also load a variable number of bits from memory

phuclv
  • 27,258
  • 11
  • 104
  • 360
  • This will still fault if the high byte of the dword load crosses into the next page. For compile-time-constant bit counts, there's no advantage to using 3 total unfused-domain uops here (`mov`-immediate + micro-fused load+bzhi) vs. `mov eax, [mem]` / `and eax, (1<<24)-1`. In fact the latter is shorter, too, and doesn't require BMI2. Or writing it your way: `mov eax, 0x00ffffff` / `and eax, [mem]`. There's no short encoding for `mov eax, imm8`, so using a narrower constant only helps for immediates to ALU instructions, not `mov`. – Peter Cordes Oct 22 '18 at 07:23