which mode in intel x86-64 is faster to execute instructions

Question

Intel has - real mode - protected mode - virtual real mode - 64-bit mode

Out of these modes, which one execute the same sets of instructions faster?

using prefixes one can change the addresses & sizes that can similar to other modes.

I'd expect 64-bit mode to be faster since Intel notoriously optimizes the most used cases (the "critical path" if you will). For example on Skylake `(i)mul r64` has better latency and uops numbers than `(i)mul r32` (though more port 6 pressure). However, while some instructs belong to a specific mode and change semantic, the real test would be performing a memory access. I don't expect differences (any should be covered within a clock period) but there could be. As a matter of fact, Real Mode was (is?) just a special configuration of Protected Mode so the same logic should be used. — Margaret Bloom, Jan 11 '18 at 12:05
Trivia: there are other modes like SMM, (the improper) unreal mode, VMX root/non-root, SGX and (once there was) ICE mode. The question of how these modes affect performance is fascinating. — Margaret Bloom, Jan 11 '18 at 12:09
The scenario is that, i want to write a general program. I want to know which mode would be faster and why? — Lakshman Siddardha, Jan 11 '18 at 17:43
@LakshmanSiddardha: That's something for the OS developer to worry about. General programs don't get to set their own preferred mode, they get whatever the OS gives them. — Ben Voigt, Jan 11 '18 at 20:47
@BenVoigt: Most modern x86-64 OSes let user-space choose between long mode and compat mode. — Peter Cordes, Jan 11 '18 at 22:25
@PeterCordes: Declaratively, yes, via the PE or ELF header data. But not imperatively. — Ben Voigt, Jan 11 '18 at 22:37
@MargaretBloom - I wouldn't say that Intel optimizes the 64-bit version of instructions in favor of the 32-bit instructions at all. Both are very important, and I suspect that for many or most instructions the 32-bit version is much more common than the 64-bit version (indeed, people aren't running around changing `int` to `long` all over the place). `mul` is a special case: the 32-bit version produces a 64-bit result but it needs to be "oddly" split (only odd if you ignore the lineage of x86-64) across `eax` and `edx`, and the extra uop is no doubt doing that splitting. — BeeOnRope, Jan 12 '18 at 18:05
I.e., there is a 64-bit multiplier which produces 2x 64-bit high and low results, so the output format for the 32->64 `mul` forms are just a little mismatch. In general, this issue doesn't arise and 32-bit and 64-bit forms of most instructions are equally speedy. There are some exceptions the other way, like `bswap` and `movbe` where the 64-bit version is slower. Plenty of 16-bit sized instructions are slower - these are uncommon. — BeeOnRope, Jan 12 '18 at 18:07
@BeeOnRope I'm not saying that Intel is optimizing the 64-bit version of *instructions*, I'm saying that if any differences exist between the *modes* of execution (a priori nothing forbids some instructions to have different latencies in real vs protected vs ia32e/long-mode) than it would probably make sense to design optimizing the 64-bit mode version. That said, the `mul` example was wrong. I don't expect differences in the ALU instructions (if any, they should only be in the AGU). — Margaret Bloom, Jan 12 '18 at 21:45
@MargaretBloom - OK, sure - but yes then example seemed contradictory since I assumed you were comparing `imul r64` against `imul r32` (for example) in the _same_ mode, but I guess you could have been comparing `imul r64` in long mode and `imul r32` in compat mode? I think it's safe to say that today long mode and compat mode both execute at full performance for essentially every instruction. Any observed differences between compiled binaries are primary related to different instructions available only in one mode and the cost of 64-bit pointers. One day, perhaps compat mode _will_ be slower. — BeeOnRope, Jan 12 '18 at 21:49
@BeeOnRope It was a wrong example :) I appreciated your insight very much (Sorry I didn't have the time to write a proper reply yesterday) — Margaret Bloom, Jan 13 '18 at 07:47

Peter Cordes · Answer 1 · 2021-01-06T15:45:08.460

TL:DR: Tell your compiler to make 64-bit executables to get max performance most of the time. But it can be worth benchmarking against a 32-bit build, especially if your code uses a lot of pointer-heavy data structures.

In theory, faster 64-bit code is almost always possible (and a few legacy realities like not assuming SSE2 for 32-bit, and the 32-bit legacy calling conventions, also favour 64-bit in practice), but sometimes having your program be faster in 64-bit mode would involve something like an ILP32 ABI such as Linux x32, or maybe using int_least32_t instead of long when you want a type that's at least 32-bit.

Intel (and AMD) CPUs don't have any inherent penalties that make decoding or execution less efficient in any mode¹.

But some choices of operand-size are worse than others (e.g. 16-bit sucks because of partial-register false dependencies or stalls), and 16-bit code needs prefixes to use 32-bit operand-size and address-size. Intel CPUs don't have a problem decoding lots of prefixes, but larger code-size in general is a bad thing, reducing code density in L1I cache and sometimes in the uop cache.

Footnote 1: except if you're using using 32-bit address-size in 16-bit mode, e.g. "big unreal mode", then Intel P6-family CPUs (i.e. before Sandybridge) will have LCP stalls on every such instruction with a 32-bit ModRM addressing mode in 16-bit mode even it's not actually length-changing, i.e. a false LCP stall. Address-size prefixes aren't useful in normal 32-bit mode (except as padding) so this problem is basically not relevant for 32-bit code.

64-bit code has larger instructions (because 64-bit operand size needs a REX prefix). Usually this doesn't matter, because the uop cache and L1I cache usually completely hide the effect of code-size on performance. 32 and 64-bit operand-size are both the same speed for most instructions, and 64-bit code can still use 32-bit operand-size except when it really needs wide types, to avoid the extra cost of 64-bit integer division (and the REX prefixes).

The scenario is that, i want to write a general program. I want to know which mode would be faster and why?

This is a different question than what you asked.

Long mode is usually fastest because it usually takes fewer instructions to get the same work done, because of better calling conventions and more registers (fewer spills). Especially if you have any FP computation, or SIMD-friendly loops, 64-bit mode can be a big win because FP code can often take advantage of more registers.

But pointer-heavy data structures in 64-bit code have twice the cache footprint of 32-bit code (which can run in protected/compat mode). Also, having a 64-bit alignment requirement can result in more struct padding, so a pointer + int struct will be 16 bytes, not 12 bytes, in 64-bit code.

So you can get more cache misses in 64-bit code, and this can make it slower than 32-bit. Linux's x32 ABI tries to get the best of both worlds (for code that doesn't need a lot of virtual address space): 32-bit pointers in long mode.

Just storing 32-bit array indices instead of pointers can work, if all the "pointers" are into the same pool that you allocate from. But beware that it can result in worse load/use latency because you (or the compiler) needs an indexed addressing mode, or a separate add instruction.

There are tricks that JVMs (for example) use to "compact" pointers in 64-bit mode. https://wiki.openjdk.java.net/display/HotSpot/CompressedOops - some kinds of pointers are stored as 32-bit that much be left-shifted by 3 for use, because they point to 8-byte-aligned heap objects. This allows addressing 32GiB of space.

My impression was that initially a typical 64-bit binary was _slower_ than its 32-bit equivalent with the same code, by a few %. That is, the size issue overwhelmed the other minor issues. Also, the very first 64-bit chips may have had more instructions that were slower in 64-bit mode than those today. I don't know if that tells us much: a lot has changed since then: compilers have probably gotten better at optimizing 64-bit code relative to 32-bit, some code has changed to reduce the pointer size impact, things like SSE2 are more often used in 64-bit, etc. — BeeOnRope, Jan 12 '18 at 21:52
@BeeOnRope: uop caches do a lot to help with slower decode from code-size from needing REX prefixes. Before that, decode / front-end bottlenecks were a big deal, and lower code density was probably part of the problem with 64-bit code being slower. And yeah, older AMD had slower 64-bit imul. So modern CPUs are certainly well optimized for 64-bit code, even modern coding style that uses `size_t` all over the place so 32-bit operand-size doesn't get used. — Peter Cordes, Jan 06 '21 at 14:55
Still, modern CPUs hide a lot of the costs of 32-bit mode well (e.g. extra store/reload from fewer regs and bad calling conventions with good store-forwarding. AMD Zen even has special zero-latency store forwarding when IIRC the same simple addressing-mode is used for both store and reload. So IIRC it will work for store/reload of local vars, although maybe not arg-passing on the stack.) — Peter Cordes, Jan 06 '21 at 14:56

which mode in intel x86-64 is faster to execute instructions

1 Answers1