Do x86 instructions require their own encoding as well as all of their arguments to be present in memory at the same time?

Question

I am trying to figure out whether it is possible to run a Linux VM whose RAM is only backed by a single physical page.

To simulate this, I modified the nested page fault handler in KVM to remove the present bit from all nested page table (NPT) entries, except the one corresponding to the currently processed page fault.

While trying to start a Linux guest, I observed that assembly instructions that use memory operands, like

add [rbp+0x820DDA], ebp

lead to a page fault loop until I restore the present bit for the page containing the instruction as well as for the page referenced in the operand (in this example [rbp+0x820DDA]).

I am wondering why this is the case. Shouldn't the CPU access the memory pages sequentially, i.e. first read the instruction and then access the memory operand? Or does x86 require that the instruction page as well as all operand pages are accessible at the same time?

I am testing on AMD Zen 1.

This is insane on the level of "boot Linux on a 486 emulator running in JavaScript in the browser". I love it. — chrylis -cautiouslyoptimistic-, Apr 02 '20 at 01:32
Heh, apparently I took this question to the same logical conclusion you were already thinking, about the minimum working set for guaranteed forward progress. I had already answered that before you added that new first paragraph to the question. :P I added some links and more details in a few spots (e.g. the page-walker is allowed to cache some guest page-directory entries internally) since this question is getting way more attention than I expected thanks to somehow making it to HNQ. — Peter Cordes, Apr 02 '20 at 04:26

Peter Cordes · Accepted Answer · 2020-04-02T08:12:49.497

Yes, they do require the machine code and all memory operands.

Shouldn't the CPU access the memory pages sequentially, i.e. first read the instruction and then access the memory operand?

Yes that's logically what happens, but a page-fault exception interrupts that 2-step process and discards any progress. The CPU doesn't have any way to remember what instruction it was in the middle of when a page-fault occurred.

When a page-fault handler returns after handling a valid page fault, RIP= the address of the faulting instruction, so the CPU retries executing it from scratch.

It would be legal for the OS to modify the machine code of the faulting instruction and expect it to execute a different instruction after iret from the page-fault handler (or any other exception or interrupt handler). So AFAIK it's architecturally required that the CPU redoes code-fetch from CS:RIP in the case you're talking about. (Assuming it even does return to the faulting CS:RIP instead of scheduling another process while waiting for disk on hard page fault, or delivering a SIGSEGV to a signal handler on an invalid page fault.)

It's probably also architecturally required for hypervisor entry/exit. And even if it's not explicitly forbidden on paper, it's not how CPUs work.

@torek comments that Some (CISC) microprocessors partially decode instructions and dump microregister state on a page fault, but x86 is not like that.

A few instructions are interruptible and can make partial progress, like rep movs (memcpy in a can) and other string instructions, or gather loads/scatter stores. But the only mechanism is updating architectural registers like RCX / RSI / RDI for string ops, or the destination and mask registers for gathers (e.g. manual for AVX2 vpgatherdd). Not keeping the opcode / decode results in some hidden internal register and restarting it after iret from a page fault handler. These are instructions that do multiple separate data accesses.

Also keep in mind that x86 (like most ISAs) guarantees that instructions are atomic wrt. interrupts / exceptions: they either fully happen, or don't happen at all, before an interrupt. Interrupting an assembly instruction while it is operating. So for example add [mem], reg would be required to discard the load if the store part faulted, even without a lock prefix.

The worst case number of guest user-space pages present to make forward progress might be 6 (plus separate guest-kernel page-table subtrees for each one):

movsq or movsw 2-byte instruction spanning a page boundary, so both pages are needed for it to decode.
qword source operand [rsi] also a page-split
qword destination operand [rdi] also a page-split

If any of these 6 pages fault, we're back to square one.

rep movsd is also a 2-byte instruction, and making progress on one step of it would have the same requirement. Similar cases like push [mem] or pop [mem] could be constructed with a misaligned stack.

One of the reasons (or side benefits) for/of making gather loads / scatter stores "interruptible" (updating the mask vector with their progress) is to avoid increasing this minimum footprint to execute a single instruction. Also to improve efficiency of handling multiple faults during one gather or scatter.

@Brandon points out in comments that a guest will need its page tables in memory, and the user-space page splits can also be 1GiB splits so the two sides are in different sub-trees of the top level PML4. HW page walk will need to touch all of these guest page-table pages to make progress. A situation this pathological is unlikely to happen by chance.

The TLB (and page-walker internals) are allowed to cache some of the page-table data, and aren't required to restart page-walk from scratch unless the OS did invlpg or set a new CR3 top-level page directory. Neither of these are necessary when changing a page from not-present to present; x86 on paper guarantees that it's not needed (so "negative caching" of not-present PTEs isn't allowed, at least not visible to software). So the CPU might not VMexit even if some of the guest-physical page-table pages are not actually present.

PMU performance counters can be enabled and configured such that the instruction also requires a perf event to a write into a PEBS buffer for that instruction. With a counter's mask configured to count only user-space instructions, not kernel, it could well be that it keeps trying to overflow the counter and store a sample in the buffer every time you return to userspace, producing a page-fault.

Worst case for a single instruction might be something like "`push dword [foo`" (or even just `call [foo]`) with everything misaligned across "page directory pointer table boundary" (adding up to 6 pages, 6 page tables, 6 page directories, 6 PDPTs and one PML4); with the CPU's "precise event based sampling with PEBS buffer" feature enabled and configured so that the `push` causes performance monitoring data to be added to the PEBS buffer. For a conservative "minimum pages provided by host so guest can make progress in pathological cases" I'd want at least 16 pages. — Brendan, Apr 01 '20 at 11:00
Note that this sort of thing has always been common in CISC-y architectures. Some microprocessors partially decode instructions and dump microregister state on a page fault, but others don't and/or require that address operands for "loop-y" instructions (DBRA on m68k, MOVC3/MOVC5 on Vax, etc) be in registers similar to your REP MOVS example. — torek, Apr 02 '20 at 07:51
@Brendan: someone counted a worst case on a VAX instruction as about 50 pages. I forget the details, but you'd obviously put the instruction itself on a page boundary, use something like the translate-table lookup with the table spanning a page boundary, use (rX)[rY] with the indirects at page boundaries, and so on. The hairiest instructions took up to 6 operands (loading them into r0-r5) and all six could be double indirects, I think. — torek, Apr 02 '20 at 07:54
The OS could change the instruction, but it can also change `EIP`. So there's a logical follow-up question. What's the minimum number of pages needed, assuming an intelligent instruction patch scheme? E.g. copy the unaligned value to an aligned scratch buffer, emulate the instruction, and IRET to the next instruction. — MSalters, Apr 02 '20 at 08:01
The page containing the OS's `iret`instruction need also be in memory. This is a one-byte instruction, so one extra page. The page fault handler interrupt address need also be in memory, but that can be the same page as above. — Stig Hemmer, Apr 02 '20 at 09:29
@StigHemmer: If there's no #PF exception, the hypervisor can have the guest kernel's pages not present. Successful page walk doesn't involve the kernel at all, just the page table data structures. And if we're not talking about a guest under a hypervisor, then normally key parts of kernel memory aren't paged at all, i.e. are pinned. But sure, we could imagine a tiny kernel that fits its code into one 4k page, including the GDT, IDT (interrupt descriptor table) and #PF handler. — Peter Cordes, Apr 02 '20 at 09:46
I must admit I fell off the x86 train before the Hypervisor arrived. I'll take your word for it. — Stig Hemmer, Apr 02 '20 at 09:54
@StigHemmer: I think the question reduces to how many pages have to be present for an instruction to make progress *without* page-faulting, so not counting any (guest-)kernel stuff other than page table data. You do raise a good point about actual practical footprint of (guest-)physical pages including the kernel, though. Guest physical memory is virtualized by a hypervisor, with HW support from nested page tables. I don't usually worry about that; I run Linux on bare metal on my desktop because hypervisors don't expose HW perf counters well / at all. — Peter Cordes, Apr 02 '20 at 10:03

Do x86 instructions require their own encoding as well as all of their arguments to be present in memory at the same time?

1 Answers1

Linked