Instructions to copy the low byte from an int to a char: Simpler to just do a byte load?

Question

I was reading a text book and it has an exercise that write x86-64 assembly code based on C code

//Assume that the values of sp and dp are stored in registers %rdi and %rsi

int *sp;
char *dp;
*dp = (char) *sp;

and the answer is:

//first approach

movl (%rdi), %eax    //Read 4 bytes
movb %al, (%rsi)     //Store low-order byte

I can understand it but just wondering can't we do sth simple in the first place as:

//second approach

movb (%rdi), %al    //Read one bytes only rather than read all four bytes
movb %al, (%rsi)     //Store low-order byte

isn't the second approach more concise and straightforward compared to the first approach which is a little bit unnescceary since we only care the lower byte of %rdi, and not really interested in its upper 3 bytes.

You need to add 1 or 3 bytes to `%rdi` (depending on endianness) to get a pointer to the low-order byte of `*sp`. — Barmar, Jul 08 '20 at 04:29
@Barmar let's say the machine is little endian, then why we need to add 1 byte? the address is the first byte by default, isn't it? — , Jul 08 '20 at 04:43
@Barmar: x86-64 is little-endian. The low byte of an `int` is the least significant. Both of the OP's versions are correct, but the 2nd one has a partial-register false dependency. — Peter Cordes, Jul 08 '20 at 05:47
@RaymondChen: All (?) current CPUs can forward efficiently from a dword store to a byte reload of any of the individual bytes. The problem with a `movb` reload is writing a partial *register* (false dependency on some microarchitectures), so it would be more efficient to `movzbl (%rdi), %eax` then store AL. [Why doesn't GCC use partial registers?](https://stackoverflow.com/q/41573502). Or since we can assume that the last store to `(%rdi)` was dword or wider, yes a dword reload has maximum efficiency if you don't have a use for the `char` value sign or zero-extended into a register later. — Peter Cordes, Jul 08 '20 at 05:51
@PeterCordes Thanks for the correction. The potential stall is on the partial register side, not the memory side. (I should have known better, having dealt with both problems in a CPU emulator.) — Raymond Chen, Jul 08 '20 at 14:23
@slowjams What you did was rewrite the assignment to `*dp = *(char*)sp;`. The textbook is providing a literal translation of the original C code, without any optimizations, to avoid confusing the reader. Your rewrite is functionally equivalent (assuming the pointer doesn't straddle a page boundary), but has the risk of running afoul of subtle CPU performance issues that are probably beyond the scope of the chapter of the textbook you are reading. — Raymond Chen, Jul 08 '20 at 14:30
@slowjams It was a mistake, I was thinking that the number was stored in pairs of 16-bit half-words. — Barmar, Jul 08 '20 at 15:59

Peter Cordes · Accepted Answer · 2020-07-08T06:05:34.567

Yes, your byte-load way is correct but it's not actually more efficient on most CPUs.
TL:DR: Generally avoid writing to byte or 16-bit registers when you have equally convenient options that don't do that.

(And BTW, the suggestions you got in comments were both wrong: x86 is little-endian, and store-forwarding problems are very unlikely (although possible maybe on some older CPUs, IDK that might not be totally wrong).)

Writing a partial register (narrower than 32-bit so it doesn't implicitly zero-extend into the full register) has a false dependency on the old value on some microarchitectures. i.e. movb (%rdi), %al decodes on Intel Haswell/Skylake as a micro-fused load+merge ALU operation. (Why doesn't GCC use partial registers?. Also for Intel Haswell/Skylake specifically, this has a lot of detail.)

It would be more efficient to movzbl (%rdi), %eax to just do a zero-extending byte load.

Or since we can assume that the last store to (%rdi) was dword or wider (so store-forwarding will be efficient if it's still in flight), it is actually most efficient to do a dword load with movl (%rdi), %eax. That avoids possible partial register penalties, and has smaller machine-code size than movzbl (smaller is better, as a tie-break between otherwise equal options in terms of uops). Also, some old AMD CPUs run movzbl slightly less efficiently than a dword mov load. (Like the zero-extending needs an ALU port).

(Most CPUs run movzbl "for free" in a load port, some also run movsbl sign-extension in a load port without needing any ALU port, notably Intel Sandybridge-family.)

Store forwarding is not a problem: all (?) current CPUs can forward efficiently from a dword store to a byte reload of any of the individual bytes, and definitely the low byte, especially when the dword store is aligned (like a C int will be). See https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/

Of course, if you have a use for char value sign- or zero-extended into a register later, load that way.

Or even better, as @Ira points out, if you're optimizing this code along with something that stored to *sp, you can ideally just use whatever is in the register and optimize away the store/reload. (It's undefined behaviour in C for any other thread to asynchronously change that memory because it's int *, not volatile or _Atomic int*.)

Ira Baxter · Answer 2 · 2020-07-08T17:55:34.120

(OP changed the question from a more general one with an example to a very specific one, which might explain why this answer looks funny wrt to the current question.)

The more general answer to your question, is that for any operation in an HLL that you intend to compile to machine code, there are usually many ways to write machine instructions to do just that operation.

A good compiler will know of many of these variants. Its problem is to choose, for all the operations in your program, the generally more efficient variations for each operator, in such a way that they stitch together to achieve a working program. For instance, if one HLL operation is implemented which leaves its result in a register, and a successor HLL operation is supposed to use that result, then the compiler much choose implementations of the first operator and the second, in which the first leaves the value in a register, and the second happens to use that register as an input or the program will not work.

When you consider that a real program consists of thousands of HLL operators, and their individual implementations must all be consistent, you can see the compiler has a very complicated job making sure everything fits together and it is reasonably efficient.

Instructions to copy the low byte from an int to a char: Simpler to just do a byte load?

2 Answers2

Linked

Related