Understanding partial-register slowdowns from mov instead of movzx instruction

Question

I'm very new to assembly language and trying to understand some of its working principles. I read this answer and have a question about the following wording:

it avoids performance penalties that may result from writing to only the low 8 (or 16) bits of a register

What kind of performance penalty? Very uclear why writing just the first low 8 or 16 bits of a register is slower than writing the sign-extended version?

I would believe there's a dupe already and you didn't search hard enough, so just comment short answer: modern x86 CPU doesn't implement that ISA directly in transistors, like 8086-80386 did (maybe even 486). The modern x86 CPU has whole different internal architecture, some having 100+ internal physical registers. To avoid stalling on false dependencies, the CPU is capable on the fly alias/rename the official public register names into its internal physical registers. Comes with price, `al` and `eax` being often two different physical regs, then must be composed together upon `eax` usage. — Ped7g, Nov 01 '17 at 10:19
If you are working through principles, you shouldn't bother with this too much, as in principle this is almost-unobservable implementation detail of particular x86 CPU, and may change with future models, while in principle as long as x86 ISA is used, `al` is bottom 8 bits of `ax/eax/rax` (from the asm programmers view). Once you are well with principles and basics, you may get distracted by performance related rules, where the particular implementation of x86 CPU model is "leaking" through various performance characteristics (so it *is* observable through performance). — Ped7g, Nov 01 '17 at 10:22
See also performance-tuning / optimization links in the [x86 tag wiki](https://stackoverflow.com/tags/x86/info). (Some of my high-voted answers are worth reading, too; I've already linked some of the good article-style ones in the tag wiki.) Anyway, good question, it just turns out that it's already answered on SO and in the good asm performance guides. Intel Nehalem and earlier have partial-flag *stalls* when reading a wider register after writing a narrow one; now it's more like just a false dependency in most cases. — Peter Cordes, Nov 01 '17 at 10:33
Do those duplicates answer your question sufficiently? I don't think this needs a separate answer, but then I'm not the beginner who doesn't already know the answer :P Let me know if those aren't exact enough duplicates and you think SO could use a simple answer to just this question. (Although IDK how simple it could be; it's basically about making out-of-order execution work well by avoiding dependencies one way or another, and writing 8 or 16 bit regs merges instead of throwing away the old value.) — Peter Cordes, Nov 01 '17 at 10:42
@PeterCordes The only thing I do not understand so far is that what it is about dependencies concern with 16 bits operators. So we actually copy the content of 16-bit part into another register and therefore introduce the dependeyncy. As specified in one of the answers you referenced we dont have such issue with 64-bit regisers because we zero-extend 32-bit part. — St.Antario, Nov 01 '17 at 13:01
Here's an example. `mov eax, [mem_that_cache_misses]` / `push eax` / now you want to use AX for something unrelated. If you use a `movzx` load to write AX and zero-extend into the rest of the register, it can happen in parallel with waiting for the cache miss. If you use `mov AX, something`, it has to wait for the cache miss before the new dep chain involving AX can start. (The false dependency couples them together and defeats out-of-order execution). Or on older Intel CPUs, AX will be renamed separately from EAX, but you'll get a stall if you read EAX while AX is renamed separately. — Peter Cordes, Nov 01 '17 at 13:25

Understanding partial-register slowdowns from mov instead of movzx instruction

0 Answers0

Linked