Why does gcc use mov %esi, %ecx for a shift count, when the arg is a 64 bit long?

Question

So I'm reading Computer Systems: A Programmer's Perspective 3rd ed. and for one of the practice problem solutions it says that the following function:

long shift(long x, long n)
{
    x <<= 4;
    x >>= n;
    return x;
}

outputs assembly code, a portion of which will be (with their comments):

# x in %rdi, n in %rsi
movq %rdi, %rax        # Get x
salq $4, %rax          # x <<= 4 
movl %esi, %ecx        # Get n (4 bytes) **QUESTION**
sarq %cl, %rax         # x >>= n

Assuming x86-64 linux based, with 'long' being quad-word size.

My question is about the line where I added 'QUESTION'. Why isn't that line 'movq %rsi, %rcx', since n is type 'long'? I don't think it's a typo since they specifically added 'Get n (4 bytes)'.

Only the low byte of `rcx` (`cl`) is used anyway. However, `movb %sil, %cl` could cause a partial register stall, while `movl %esi, %ecx` won't (AFAIK), since it implictly clears the upper half of `rcx`. — Michael, Apr 27 '18 at 09:04
Sorry I have no idea what partial register stall means since I'm just learning (as you can tell), but I did think about `movb %sil, %cl`, and wondered why they didn't just do that. I understand that only the low byte (cl) is used, but I just didn't get why they chose specifically to take 4 bytes when n is 8 bytes. — , Apr 27 '18 at 09:20
And while only 8 bit `cl` is used as argument for `sar`, only the low 6 bits of that `cl` are actually used for the shift, i.e. for `n=0x12345678` the real shift done by CPU is only `0x78 & 0x3F = 0x38 = 56` bit shift. For 32 bit targets only 5 bits are used, etc.: http://www.felixcloutier.com/x86/SAL:SAR:SHL:SHR.html ... the `n` defined as 64b long is overkill in this particular case already on the C source level. Which makes me actually wonder, if the C operator `>>` is UB for `n > 63`? ... about why full `rcx` is set (writing into `ecx` sets `rcx`) = performance. Don't worry about it now? — Ped7g, Apr 27 '18 at 09:20
@Ped7g I don't get what you said at the end there: "about why full rcx is set (writing into ecx sets rcx) = performance." I think writing into %ecx sets the higher 32 bits to 0. Is that what you're referring to? Is it better for performance or something? If that's the case I won't worry about it, I'd just like to know that there IS a reason. — , Apr 27 '18 at 09:24
But a short explanation is, that modern x86 CPU, when you do `mov ecx,1` `mov cl,1` will write those two ones into two different physical internal CPU registers (to allow parallel/out-of-order execution as they both write into same register, but don't truly depend on each other), and it will merge them back into single `rcx` later, usually on first `rcx/ecx/cx` usage (using just `cl` will keep using the second internal reg), and that merge is "partial register stall" performance penalty. By setting whole `rcx` there's no dichotomy created and only single internal register represents whole rcx. — Ped7g, Apr 27 '18 at 09:25
and that `cl` in `sar` is taken from that `rcx`. Yes, setting `ecx` will clear upper 32 bits of `rcx`, so the CPU can have final `rcx` value right upon writing `ecx`. OTOH `mov %si,%cx` (gas syntax, above the `mov ecx,1` is Intel syntax) would set only low 16 bits of `rcx`, and requires merge with upper 48 bits of original `rcx` value, which is (may be .. on some CPU models.. some may merge it physically right upon `mov`) postponed until the `rcx/ecx` usage is found. — Ped7g, Apr 27 '18 at 09:28
I really just want to know why they only used 4 bytes instead of the whole 8 bytes. — , Apr 27 '18 at 09:29
because the compiler knows that the value will be truncated to 6 bits any way (by the `sar` instruction - I mean truncated just temporarily, for the shift, not writing it back into `cl` truncated), and the `mov %esi, %ecx` has shorter opcode (-1B) than `mov %rsi,%rcx`, so it has same or better performance. (shorter machine code will clutter instruction cache less, so it gives you bigger buffer to exhaust in performance critical code) — Ped7g, Apr 27 '18 at 09:30
why `movb %sil, %cl` is sub-optimal: it would need a REX prefix. It won't directly cause a partial-register stall, @Michael, (because you're not later reading the rest of `%rcx`), but the `mov` itself does create a false dependency on some CPUs: [Why doesn't GCC use partial registers?](//stackoverflow.com/q/41573502). 32-bit `mov` is optimal because it's only 2 bytes, while `mov %rsi, %rcx` would also need a REX prefix. — Peter Cordes, Apr 27 '18 at 09:32
Possible duplicate: [The advantages of using 32bit registers/instructions in x86-64](//stackoverflow.com/q/38303333). I think those Q&As cover it, especially with the comments. We can reopen this if anyone thinks it needs its own answer. — Peter Cordes, Apr 27 '18 at 09:34
Take my performance explanations with grain of salt, Peter Cordes knows better, but the point is, that it's internal modern x86 CPU architecture causing the compiler to produce sometimes machine code, which may look not that optimal to human, but the compiler has often very good reasons and the whole topic is quite advanced and finicky, if you are just starting with assembly, focus first to learn+understand the basic x86 instructions, without worrying too much about the microarchitecture underneath or performance subtleties, the current CPUs are really complex machines and there's a lot under. — Ped7g, Apr 27 '18 at 09:37
Thanks for the advice. It's just confusing for a beginner when the code seems to have random subtleties. I'll try not to get too caught up in it! — , Apr 27 '18 at 09:41
well, you should know that the `sar` will use only 6 bits of `n`, that should lead you to understanding, that the 4 byte variant will function in the same way as 8 byte variant, that level of knowledge is required for basic assembly, and you should be aiming for that, so that you can first determine what will be the result. After that level you can try to learn more, why that particular variant was selected by compiler. In asm you have always many options, like `rcx = 0`: `mov $0, %ecx`, `mov $0, %rcx`, `xor %rcx, %rcx`, `xor %ecx, %ecx` (being almost identical except `xor` affects flags) — Ped7g, Apr 27 '18 at 09:52

Why does gcc use mov %esi, %ecx for a shift count, when the arg is a 64 bit long?

0 Answers0