Why do assembly need repeated operation of movzx on eax?

Question

Code - Difference is one method is for signed short int-s and another for unsigned short int.

short signedShortIntSwap(short int* a , short int* b)
{
    short tmp = *a;
    *a = *b;
    *b = tmp;
    return *a;
}

unsigned short unsignedShortIntSwap(unsigned short int* a ,unsigned short int* b)
{
    unsigned short tmp = *a;
    *a = *b;
    *b = tmp;
    return *a;
}

Assembly: - gcc -c -m64 -o func1 func1.c -O2 -fno-tree-vectorize

0000000000000000 <signedShortIntSwap>:
   0:   f3 0f 1e fa             endbr64 
   4:   0f b7 07                movzx  eax,WORD PTR [rdi]
   7:   0f b7 16                movzx  edx,WORD PTR [rsi]
   a:   66 89 17                mov    WORD PTR [rdi],dx
   d:   66 89 06                mov    WORD PTR [rsi],ax
  10:   0f b7 07                movzx  eax,WORD PTR [rdi]
  13:   c3                      ret    
  14:   66 66 2e 0f 1f 84 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
  1b:   00 00 00 00 
  1f:   90                      nop

0000000000000020 <unsignedShortIntSwap>:
  20:   f3 0f 1e fa             endbr64 
  24:   0f b7 07                movzx  eax,WORD PTR [rdi]
  27:   0f b7 16                movzx  edx,WORD PTR [rsi]
  2a:   66 89 17                mov    WORD PTR [rdi],dx
  2d:   66 89 06                mov    WORD PTR [rsi],ax
  30:   0f b7 07                movzx  eax,WORD PTR [rdi]
  33:   c3                      ret    
  34:   66 66 2e 0f 1f 84 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
  3b:   00 00 00 00 
  3f:   90                      nop

Why do we have movzx eax,WORD PTR [rdi] repeated for each functions at address 4 && 10 and 24 && 30.
Why is for both signed and unsigned function have identical instruction set. In which case it differs?

movzx is used because it avoids the false dependency you'd get from `mov ax, [rdi]`. Unlike writing a 32-bit register (which implicitly zero-extends to 64-bit), writing 8 or 16-bit partial register logically merges. (Some CPUs do partial-register renaming to avoid that). [Why doesn't GCC use partial registers?](https://stackoverflow.com/q/41573502) — Peter Cordes, Apr 26 '20 at 09:13
Oh, you're not asking why movzx, you're asking about reloading EAX after the store. That's because `a` and `b` might point to the same `short`, so `*b = tmp` might have modified `*a`. (A smarter compiler would have realized that if that's the case, both values loaded must be the same, because it can assume no data-race UB). I'll look for another duplicate and reopen if I don't find one. — Peter Cordes, Apr 26 '20 at 09:18
Related: [Why compilers no longer optimize this UB with strict aliasing](https://stackoverflow.com/q/34527729). Re: 2. the upper bits of the return value don't matter, the caller is required to ignore them, so it's good that GCC always uses `movzx` to avoid false dependencies because it's faster than `movsx` on some CPUs. — Peter Cordes, Apr 26 '20 at 09:30
Take a look at how the output assembly changes if you add the `restrict` keyword to your parameters. — Joseph Sible-Reinstate Monica, Apr 26 '20 at 19:24
@JosephSible-ReinstateMonica: Yes, it removed that additional movzx statement. — InQusitive, Apr 27 '20 at 07:46
@PeterCordes: When I removed those return statements from my c code, I found out that, those additional movzx statements are removed. I might have overthought earlier, we needed those repeated movzx because eax(return value) has to be changed to the new value from the earlier value it was holding because the internal value has changed. If so, It has nothing to do with both a and b holding same value as you mentioned. What do you think? — InQusitive, Apr 27 '20 at 08:19
Well yeah, needing `*a` in AX as the return value is of course what it's doing there. But you'd hope that the compiler could trace the assignments through the swap and arrange for the initial loads to leave the right value in AX, despite the possibility of `a == b` (`*a == *b` is irrelevant of course). The fact that that doesn't happen is interesting, and I think a missed optimization. You could help the compiler by using a 2nd local tmp var and returning that, instead of reading `*a` after writing `*b`. — Peter Cordes, Apr 27 '20 at 08:33
Or if you just returned `*b` (which is the last thing written), the compiler could again arrange for that to still be in AX. Yup, GCC and clang both manage that: https://godbolt.org/z/h3Ijaw Also including a version that uses a tmp so the compiler doesn't need alias analysis. — Peter Cordes, Apr 27 '20 at 08:40
Anyway, my reasoning for it being a missed optimization was: either `*a` and `*b` don't overlap so writing `*b` doesn't disturb `*a` (and thus the compiler knows what's still there). *Or* `a == b` so they fully overlap, and `*a == *b`. So the value loaded earlier as the initial `*b` value is still the correct `*a` value to return. It would be UB for `*a` and `*b` to *partially* overlap, because of strict aliasing and also that `alignof(short) == 2`. https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/ — Peter Cordes, Apr 27 '20 at 08:42

Why do assembly need repeated operation of movzx on eax?

0 Answers0