2

I've tried the following C++ code:

void foo( ) {
    char c = 'a';
    c = c + 1;
}

Got the following results x86-64 gcc 10.1 default flags:

    mov     BYTE PTR [rbp-1], 97
    movzx   eax, BYTE PTR [rbp-1]  ; EAX here
    add     eax, 1
    mov     BYTE PTR [rbp-1], al

But! Got the following results x86-64 djgpp 7.2.0 default flags:

    mov     BYTE PTR [ebp-1], 97
    mov     al, BYTE PTR [ebp-1] ; AL here
    inc     eax
    mov     BYTE PTR [ebp-1], al

Why does GCC use EAX instead of AL?

And why does djgpp use AL only?

Is it performance issues?

If so what kind of performance issues stand behind using 32-bits register for 8-bits value?

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
No Name QA
  • 643
  • 4
  • 13
  • 2
    This has nothing to do with C++. Your compiler "decides" on the assembly output. Which compiler do you use? And which flags are you setting while compiling? – RoQuOTriX Jul 09 '20 at 06:32
  • Did you try changing the compiler? – kesarling Jul 09 '20 at 06:33
  • @RoQuOTriX I've updated my question – No Name QA Jul 09 '20 at 06:35
  • 1
    Maybe also add the OS. And to talk about performance and optimization use the -O flags (-O3 e.g.) – RoQuOTriX Jul 09 '20 at 06:36
  • 1
    Looking at unoptimized output doesn't really teach you much. It's more about making every single statement produce the precise debug output one might expect. Turn on any kind of optimization and this entire routine disappears because it doesn't actually do anything. Kinda by definition unoptimized output never gets optimized for best performance. – David Wohlferd Jul 09 '20 at 06:37
  • 2
    @DavidWohlferd, Adding optimisation produces literally [nothing](https://godbolt.org/z/w0fFrR): – kesarling Jul 09 '20 at 06:39
  • 2
    What makes you think `mov eax, addr` is more expensive than `mov ax, addr` or `mov al, addr` ? It's a 32-bit bus (at least) and transferring less than that size (probably) doesn't save you anything. – selbie Jul 09 '20 at 06:39
  • @d4rk4ng31 I changed the compiler and saw that djgpp uses AL. Seems weird. Maybe you have any ideas why? – No Name QA Jul 09 '20 at 06:41
  • 4
    Also add`-O2` to your compiler command line. That function gets reduced to nothing. – selbie Jul 09 '20 at 06:41
  • @selbie `What makes you think mov eax, addr is more expensive than mov ax, addr`. I though that CPU needs to drop 24 bits to store `char` in 8-bits register since we have let's say 32-bits bus. – No Name QA Jul 09 '20 at 06:44
  • see https://stackoverflow.com/questions/46073295/implicit-type-promotion-rules, operations are supposed to be performed as integers – Alan Birtles Jul 09 '20 at 06:48
  • Using smaller registers often produces slower opcode. `mov ax,5` generates slower code than `mov eax,5` – Waqar Jul 09 '20 at 06:49
  • My other hypothesis is that `mov ax, addr` would leave garbage data in the remaining bits of that 32-bit register. Now imagine trying to debug through that when optimizations are turned off. – selbie Jul 09 '20 at 06:52
  • @selbie nice point, thank you! – No Name QA Jul 09 '20 at 06:54
  • GCC wants to avoid writing a partial register. A `movzx` load into a full register is like a byte-load on a RISC machine. `mov al, [mem]` is a merge. Of course it would make even more sense for a compiler to do `add byte ptr [rbp-1], 1` (in debug-mode where it chooses not to optimize away the whole thing). You'd also expect that if you took a `char*` arg and incremented the pointed-to memory. See [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) for tips on writing interesting small functions that don't optimize away. – Peter Cordes Jul 09 '20 at 19:42
  • 1
    Also, wow, that's very bad ASM from djgpp! Guaranteed partial-register stall on P6-family when `inc eax` reads EAX after the write of AL, unless there was an `xor eax,eax` earlier. It does save 1 byte of code-size over `inc al`, but a memory-destination `inc` would be much smaller. Of course this is un-optimized code, so *hopefully* it will never do this for real? OTOH clang is also occasionally sloppy with partial-registers, but mostly "just" taking false-dependency risks (on modern uarches) instead of causing partial-reg stalls on old CPUs. – Peter Cordes Jul 24 '20 at 23:20

2 Answers2

2

On AMD and recent Intel processors loading a partial register requires previous value of the whole register in order to combine it with the loaded value to produce the new register value.

If the full register is written the old value is not required and therefore, with register renaming, can be done before the previous write of the register.

Timothy Baldwin
  • 2,998
  • 1
  • 10
  • 19
0
unsigned char fun ( unsigned char a, unsigned char b )
{
    return(a+b);
}

Disassembly of section .text:

0000000000000000 <fun>:
   0:   8d 04 3e                lea    (%rsi,%rdi,1),%eax
   3:   c3                      retq  

Disassembly of section .text:

00000000 <fun>:
   0:   e0800001    add r0, r0, r1
   4:   e20000ff    and r0, r0, #255    ; 0xff
   8:   e12fff1e    bx  lr


Disassembly of section .text:

00000000 <fun>:
   0:   1840        adds    r0, r0, r1
   2:   b2c0        uxtb    r0, r0
   4:   4770        bx  lr

Disassembly of section .text:

00000000 <fun>:
   0:   952e                    add x10,x10,x11
   2:   0ff57513            andi    x10,x10,255
   6:   8082                    ret

different targets all from gcc.

This is a compiler choice so you need to talk to the compiler authors about it, not Stack Overflow. The compiler needs to functionally implement the high level language, so in these cases all of which have 32 bit GPRs the choice is do you mask every operation or at least before the value is left to be used later or do you assume that the register is dirty and you need to mask it before you use it or do you have architectural features like eax can be accessed in smaller parts ax, al, and design around that? so long as it functionally works any solution is perfectly fine.

One compiler may choose to use al for 8 bit operations another may choose eax (which is likely more efficient from a performance perspective, there is stuff you can read up on that topic) in both cases you have to design for the remaining bits in the rax/eax/ax register and not oops it later and use the larger register.

Where you don't have this option of partial register access you pretty much need to functionally implement the code and the easy way is to do the mask thing. This would match the C code in this case, and one could argue that the x86 code is buggy because it uses eax but doesn't clip so it does not return an unsigned char.

Make it signed though:

signed char fun ( signed char a, signed char b )
{
    return(a+b);
}

Disassembly of section .text:

0000000000000000 <fun>:
   0:   8d 04 3e                lea    (%rsi,%rdi,1),%eax
   3:   c3                      retq  

Disassembly of section .text:

00000000 <fun>:
   0:   e0800001    add r0, r0, r1
   4:   e1a00c00    lsl r0, r0, #24
   8:   e1a00c40    asr r0, r0, #24
   c:   e12fff1e    bx  lr

Same story, one compiler design is clearly going to handle the variable size one way and the other right there and then.

Force it to deal with the size in this function

signed char fun ( signed char a, signed char b )
{
    if((a+b)>200) return(1);
    return(0);
}

Disassembly of section .text:

0000000000000000 <fun>:
   0:   40 0f be f6             movsbl %sil,%esi
   4:   40 0f be ff             movsbl %dil,%edi
   8:   01 f7                   add    %esi,%edi
   a:   81 ff c8 00 00 00       cmp    $0xc8,%edi
  10:   0f 9f c0                setg   %al
  13:   c3                      retq 

Disassembly of section .text:

00000000 <fun>:
   0:   e0800001    add r0, r0, r1
   4:   e35000c8    cmp r0, #200    ; 0xc8
   8:   d3a00000    movle   r0, #0
   c:   c3a00001    movgt   r0, #1
  10:   e12fff1e    bx  lr

Because the arm design knows the values passed in are already clipped and this was a greater than they chose to not clip it, possibly because I left this as signed. In the case of x86 though because they don't clip on the way out they clipped on the way into the operation.

unsigned char fun ( unsigned char a, unsigned char b )
{
    if((a+b)>200) return(1);
    return(0);
}

Disassembly of section .text:

00000000 <fun>:
   0:   e0800001    add r0, r0, r1
   4:   e35000c8    cmp r0, #200    ; 0xc8
   8:   d3a00000    movle   r0, #0
   c:   c3a00001    movgt   r0, #1
  10:   e12fff1e    bx  lr

Now that I would disagree with because for example 0xFF + 0x01 = 0x00 and that is not greater than 200, but this code would pass it through as greater than 200. They also used the signed less than and greater than on an unsigned compare.

unsigned char fun ( unsigned char a, unsigned char b )
{
    if(((unsigned char)(a+b))>200) return(1);
    return(0);
}
00000000 <fun>:
   0:   e0800001    add r0, r0, r1
   4:   e20000ff    and r0, r0, #255    ; 0xff
   8:   e35000c8    cmp r0, #200    ; 0xc8
   c:   93a00000    movls   r0, #0
  10:   83a00001    movhi   r0, #1
  14:   e12fff1e    bx  lr

Ahh, there you go some C language promotion thing. (just like float f; f=f+1.0; vs f=f+1.0F;)

and that changes the x86 results as well

Disassembly of section .text:

0000000000000000 <fun>:
   0:   01 fe                   add    %edi,%esi
   2:   40 80 fe c8             cmp    $0xc8,%sil
   6:   0f 97 c0                seta   %al
   9:   c3                      retq 

Why does GCC use EAX instead of AL?

And why does djgpp use AL only?

Is it performance issues?

These are compiler design choices, not issues, not performance necessarily, but overall compiler design as to how to implement the high level language with the targets instruction set. Each compiler is free to do that however they wish, no reason to expect gcc and clang and djgpp and others to have the same design choices, no reason to expect gcc version x.x.x and y.y.y to have the same design choices either, so if you go far enough back perhaps it was done differently, perhaps not (and if they had then maybe the commit explains the "why" question and or developer group emails from that time would cover it).

halfer
  • 18,701
  • 13
  • 79
  • 158
old_timer
  • 62,459
  • 8
  • 79
  • 150