Why does x86 only have 1 form of conditional move, not immediate or 8-bit?

Question

I've noticed that the Conditional Move instruction is less extensible than the normal mov. For example, it doesn't support immediates and doesn't support the low-byte of a register.

Out of curiosity, why is the Cmov command much more restrictive than the general mov command? Why, for example, wouldn't both allow something like:

mov    $2, %rbx    # allowed
cmovcc $1, %rbx    # I suppose setcc %bl could be used for the '1' immediate case

As a side note, I've noticed when using Compiler Explorer that the cmovcc is used much less than the jmpcc and setcc. is this normally the case, and if so, why is it used less frequently than the other conditionals?

I've recently been learning about the conditional select instruction on aarch64, which is interesting to compare. They don't have immediates either, but they do have variants that increment, complement, or negate one of the two values. Since they have a zero register, this effectively gives you almost every combination of 0, 1 and -1 as "immediates", which probably covers 98% of likely use cases. — Nate Eldredge, Sep 27 '20 at 04:30
What did intel say? I dont know that the original creators are still around or waiting for why did you do that questions, but you might get lucky. — old_timer, Sep 27 '20 at 04:32
Note that the `mov` *instruction* (there are no *commands* in assembly, only *instructions*) is actually about a dozen different instructions that share the same name. That's why it appears to be so versatile. — fuz, Sep 27 '20 at 15:38

Peter Cordes · Accepted Answer · 2020-09-27T07:48:13.810

Being conditional, it already needs 16 different opcodes just for the cmov r, r/m form, one for each different cc condition, just like jcc and setcc (synonyms share an opcode, of course).

So even if there was "room" for another 16 0F xx opcodes, it probably wouldn't have been worth spending all that coding space when Intel was adding it for Pentium Pro. Well maybe for a sign-extended-imm8 form. That would have taken away room for other new opcodes, like for MMX and SSE instructions which Intel probably already had started to design or at least think about for Pentium-MMX and Pentium III when the ISA extensions for P6 were being finalized.

An imm8 form would be useful most of the time when you want a cmov at all (often to conditionally zero something), but it's not necessary. The RISC philosophy (which Intel was leaning into with P6¹) would favour only providing one way, and letting code use a mov-immediate to create a constant in another register if desired.

Out-of-order exec can often hide the cost of mov-immediate to put a constant in another register. Such an instruction is independent of everything else and can execute as soon as there's a spare cycle on the execution port it's scheduled to. (However, the front-end is often a real bottleneck, and static code-size does matter, so it's unfortunately not free.)

Footnote 1: RISC ideas were a big thing for the P6 microarchitecture, most notably the revolutionary idea of decoding x86 instruction to 1 or more uops for its RISC-like back-end, allowing out-of-order exec of different parts of one memory-destination instruction (load / ALU / store), for example.

But also in smaller decisions, for example P6 doesn't have hardware support for maintaining TLB coherence across the uops of one instruction. That's why adc %reg, (mem) needs more uops than you'd expect on Intel CPUs. Andy Glew (Intel architect who worked on P6) explained that in a Stack Overflow comments (which I quoted in this answer), including saying 'I was a RISC proponent when I joined P6, and my attitude was "let SW (microcode) do it".'

It's easy to see how this attitude could extend to x86 ISA design, and only providing the bare minimum form of cmov. (8-bit is hardly necessary; you can always move the whole register, and you often want to avoid partial registers in high-performance code anyway because of possible stalls. Which were even more costly on PPro than on later P6 like Core 2. Sandybridge-family made partial-register merging even cheaper.)

But this is pure speculation on my part about what factors may have influenced that design decision.

The cost (in power and die area, and achievable clock speed) of adding transistors to decode an imm8, imm32, and/or r/m8 encoding of cmov would have to be weighed against the expected real-world speedup from code being able to use it. As well as against the future cost of using up more opcode coding space.

Other than the future cost of coding-space (which let MMX and SSE1 instructions have only 2-byte opcodes), Intel might have guessed wrong on this by omitting cmov $sign_extended_imm8, %reg which would actually be useful fairly often.

It's used less because it's only useful when it's cheap to compute the result of both sides of a condition and select one, instead of just branching and only doing one. It's useful as an optimization, especially when a compiler expects that a branch would predict poorly. Purpose of cmove instruction in x86 assembly?

More general cpu-architecture background about control dependencies (branching) vs. data dependencies (cmov): difference between conditional instructions (cmov) and jump instructions

See Conditional move (cmov) in GCC compiler re: when GCC does if-conversion into branchless asm.

Using cmov can even hurt if you do it wrong (gcc optimization flag -O3 makes code slower than -O2), for cases where branch prediction would have predicted pretty accurately (e.g. on the special case of sorted input data).

On older CPUs with shorter / narrower pipelines and smaller out-of-order execution resources (so the cost of a mispredict was lower), CMOV was useful in even fewer cases. Especially on Intel before Broadwell where it takes 2 uops instead of 1. Linux Torvalds explained why it sucks for a lot of common cases, with some tests on a Core 2 CPU back in 2007: https://yarchive.net/comp/linux/cmov.html

It's certainly not rare to see compilers generate it, though, if you write code that selects from a couple values based on a condition. Clang's heuristics tend to favour using more cmov than GCC, i.e. more aggressive if-conversion to branchless.

Note that setcc doesn't get used a lot either, unless you frequently look at non-inlined versions of functions that return a boolean.

I disassembled libperl.so on my Arch Linux desktop (just picked a random large binary), compiled by GCC 10.1.0. Out of 377835 total instructions (objdump -d | egrep ' +[0-9a-f]+:'| wc -l):

setcc appeared 1783 times, often in setxx a / setxx b / or a,b to do one branch on multiple conditions.
cmovcc appeared 1737 times. objdump -drwC -Mintel /usr/lib/perl5/5.32/core_perl/CORE/libperl.so | egrep 'cmov[a-z]+ ' | wc

I disagree about it not being useful; 98% of the time I use cmov, I need it with an immediate. — prl, Sep 27 '20 at 07:07
@prl: I'd actually agree with that. It's a hassle to hoist constant setup out of a loop, if there is one, otherwise it's pure overhead. A sign-extended `imm8` form might have been worth the coding space. So maybe we should instead say it wasn't *necessary*, and Intel was trying to embrace RISC ideas with P6, sometimes to the detriment of the design. — Peter Cordes, Sep 27 '20 at 07:10
I think the lack of available space in the opcode planes might have been the main problem. Note that you actually need 2x16 opcodes if you want an immediate form (8 bit immediate, 16/32 bit immediate). I mean they could have moved some bits into the second operand you don't need for an immediate-moving variant, but I guess it was ultimately rejected. — fuz, Sep 27 '20 at 15:41
@fuz: I really did mean providing only an `imm8` form, with larger immediates using `mov` to a tmp reg. Many use cases involve small numbers. And indeed, you *could* use the `/r` field in modrm to encode 3 opcode bits like many immediate instructions, reducing it to 2 total opcodes, but note that `setcc` doesn't do that either. The decoders already know how to decode a 4-bit condition from the low bits of the opcode itself (for JCC), and with JCC and SETCC both working that way, it makes some sense to keep the condition-code bits in the same place. Even though it does burn more coding space — Peter Cordes, Sep 27 '20 at 16:25
@PeterCordes Oh yeah one thing to keep in mind is that if it was `cmov r/m32, imm8`, then `cmov` would always be a read/modify/write operation (unless they'd do some microcode trickery), making it fairly useless with memory operands. — fuz, Sep 27 '20 at 16:27
@fuz: I was figuring that Intel would design it to `#UD` if `r/m32` encoded a memory destination, or just use whatever common-case uop generation handles generating the load and store, letting users shoot themselves in the foot if they want. The decoders already know how to decode opcode bits from `/r`, otherwise you'd use `/r` as the destination and have 5 total opcode bits between the Mode and r/m fields. IIRC, there is at least one other instruction that only allows a register as the r/m, but I forget which. (It might be more recent than P6, and maybe privileged?) Some only allow mem... — Peter Cordes, Sep 27 '20 at 16:36
@PeterCordes Some x87 instructions are distinguished by addressing mode. For example, `d9 /5` is `fldcw m16` whereas a register operand generates various constants (`fld1`, `fldl2t`, `fldl2e`, `fldpi`, `fldlg2`, `fldln2`, and `fldz ). — fuz, Sep 27 '20 at 17:12

Why does x86 only have 1 form of conditional move, not immediate or 8-bit?

1 Answers1