2

I have following snippet:

typedef struct {
        unsigned int gender: 1;
        unsigned int age:    7;
        unsigned int uid:    24;
} person;

int 
f(person *p) {
        return p->gender;
}

which generates (gcc -O2):

f:
        movzx   eax, BYTE PTR [rdi]           ; this line looks interesting.
        and     eax, 1
        ret

Questions

  • If I understand correctly MOVZX adds 0s at the beginning. As you can see in the next line it ANDs the value with 1. The question is: does MOVZX provide any value here? I mean wouldn't it work with a normal MOV? I need only the first bit there right?
  • I sadly couldn't find a "speed comparison" between MOV vs MOVZX. Can someone tell me does one perform better than the other (I'm assuming that MOVZX is slower, because it has to go and "clear out" the bits. If I'm not wrong)?
  • Does this answer your question? [Compiler generates costly MOVZX instruction](https://stackoverflow.com/questions/43491737/compiler-generates-costly-movzx-instruction) (to be clear, title is wrong, `movzx` isn't actually costly, it's reading from memory that costs something). – ShadowRanger Apr 06 '21 at 16:38
  • @ShadowRanger I don't know. Can you maybe explain why is it there in my example? I couldn't understand it from the given link. I found that link before asking this question. However my knowledge is extremely limited in this area. That's why am I asking. –  Apr 06 '21 at 16:40
  • 1
    [This answer (second ranked at time of writing)](https://stackoverflow.com/a/43910889/364696) fully answers your question as to why the compiler is using `movzx`; it removes partial register dependencies, so if stuff was left in other parts of `eax` it's not forced to wait on them. If it didn't do that, `and` would have to wait for the other parts of `eax` to "finalize" before it could do its work. `movzx` actually has negative cost; the cost of zero-extension is zero, and if it breaks a dependency, allowing the `and` to run immediately when it otherwise wouldn't, it actually speeds things up. – ShadowRanger Apr 06 '21 at 16:42
  • 2
    `mov` and `movzx` loads are the same speed on modern CPUs. Zeroing the high bits happens right in the load port, no separate ALU uop needed; https://uops.info/ and https://agner.org/optimize/ – Peter Cordes Apr 06 '21 at 16:43
  • 2
    @ShadowRanger: `mov eax, dword ptr [rdi]` would also be legal here, and avoid partial *register* problems. But that could give you a store-forwarding stall depending on recent stores, so I'd guess GCC is loading the smallest possible chunk that fully contains the bitfield you want. (And yes, a zero-extending load is generally the best way to do that, and what GCC always chooses, despite costing an extra byte of code-size vs. byte or dword `mov`.) – Peter Cordes Apr 06 '21 at 16:46
  • 1
    After reading the comments and the contents of links multiple times... Now I do understand the concept. Thank you all. –  Apr 06 '21 at 17:11
  • 2
    [Why doesn't GCC use partial registers?](https://stackoverflow.com/a/41574531) is also a good answer for why GCC chooses movzx instead of a byte `mov`, in case that's what you were wondering. For details on its performance, see [Any way to move 2 bytes in 32-bit x86 using MOV without causing a mode switch or cpu stall?](https://stackoverflow.com/q/13092829) (the premise of the question is wrong, there's never a "mode switch" and it's not slow on CPUs after P5 Pentium) – Peter Cordes Apr 06 '21 at 19:01

0 Answers0