2

Is there any execution timing difference between 8 it and 64 bit instructions on 64 bit x64/Amd64 processor, when those instructions are similar/same except bit width? Is there a way to find real processor timing of executing these 2 tiny assembly functions?

-Thanks.

; 64 bit instructions
add64:
     mov  $0x1, %rax
     add  $0x2, %rax
     ret

; 8 bit instructions
add8:
     mov  $0x1, %al
     add  $0x2, %al
     ret
Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
sukumarst
  • 255
  • 1
  • 5
  • 1
    processor timing is non-deterministic. it is pipelined, in this case microcoded, etc. any documented timing you might find is under specific ideal situations which you wont really see for more than n number of instructions in a row...in this case though the busses are wide enough that it shouldnt make a difference but the 8 bit might be slower due to masking – old_timer Dec 18 '20 at 14:58
  • In theory, modern processors can do 64-bit addition as fast as they can do 8-bit addition. However, timing is heavily dependent upon the context a group of instructions find themselves in -- in the case of these examples, the instructions of the calling function would also come into play and actual results could vary widely in actual use. The masking old_timer refers to is that from an architectural perspective, the 8-bit operation requires the processor to maintain the upper 7 bytes of the old `rax` and combine that with the 8-bit answer; whereas the 64-bit operation produces a new `rax`. – Erik Eidt Dec 18 '20 at 16:43
  • 1
    As a very rough rule of thumb, you can expect that "simple" instructions (mov, add, sub, bitwise) will be the same, while "complex" instructions (multiply, divide, etc) may be slower for larger operands. A useful resource for x86 instruction timings is https://uops.info. – Nate Eldredge Dec 18 '20 at 17:29

1 Answers1

4

Yes, there's a difference. mov $0x1, %al has a false dependency on the old value of RAX on most CPUs, including everything newer than Sandybridge. It's a 2-input 1-output instruction; from the CPU's point of view it's like add $1, %al as far as scheduling it independently or not relative to other uses of RAX. Only writing a 32 or 64-bit register starts a new dependency chain.

This means the AL return value of your add8 function might not be ready until after a cache miss for some independent work the caller happened to be doing in EAX before the call, but the RAX result of add64 could be ready right away for out-of-order execution to get started on later instructions in the caller that use the return value. (Assuming their other inputs are also ready.)

Their code-size also differs: Both the 8-bit instructions are 2 bytes long. (Thanks to the AL, imm8 short-form encoding; add $1, %dl would be 3 bytes). The RAX instructions are 7 and 4 bytes long. This matters for L1i cache footprint (and on a large scale, for how many bytes have to get paged in from disk). On a small scale, how many instructions can fit into a 16 or 32-byte fetch block if the CPU is doing legacy decode because the code wasn't already hot in the uop cache. Also code-alignment of later instructions is affected by varying lengths of previous instructions, sometimes affecting which branches alias each other.

https://agner.org/optimize/ explains the details of the pipelines of various x86 microarchitectures, including front-end decoding effects that can make instruction-length matter beyond just code density in the I-cache / uop-cache.

Generally 32-bit operand-size is the most efficient (for performance, and pretty good for code-size). 32 and 8 are the operand-sizes that x86-64 can use without extra prefixes, and in practice with 8-bit to avoid stalls and badness you need more instructions or longer instructions because they don't zero-extend. The advantages of using 32bit registers/instructions in x86-64.

A few instructions are actually slower in the ALUs for 64-bit operand-size, not just front-end effects. That includes div on most CPUs, and imul on some older CPUs. Also popcnt and bswap. e.g. Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux

Note that mov $0x1, %rax will assemble to 7 bytes with GAS, unless you use as -O2 (not the same as gcc -O2, see this for examples) to get it to optimize to mov $1, %eax which exactly the same architectural effects, but is shorter (no REX or ModRM byte). Some assemblers do that optimization by default, but GAS doesn't. Why NASM on Linux changes registers in x86_64 assembly has more about why this optimization is safe and good, and why you should do it yourself in the source especially if your assembler doesn't do it for you.


But other than the false dep and code-size, they're the same for the back-end of the CPU: all those instructions are single-uop and can run on any scalar-integer ALU execution port1. (https://uops.info/ has automated test results for every form of every unprivileged instruction).

Footnote 1: Excavator (last-gen Bulldozer-family) can also run mov $imm, %reg on 2 more ports (AGU) for 32 and 64-bit operand-size. But merging a new low-8 or low-16 into a full register needs an ALU port. So mov $1, %rax has 4/clock throughput on Excavator, but mov $1, %al only has 2/clock throughput. (And of course only if you use a few different destination registers, not actually AL repeatedly; that would be a latency bottleneck of 1/clock because of the false dependency from writing a partial register on that microarchitecture.)

Previous Bulldozer-family CPUs starting with Piledriver can run mov reg, reg (for r32 or r64) on EX0, EX1, AGU0, AGU1, while most ALU instructions including mov $imm, %reg can only run on EX0/1. Further extending the AGU port's capabilities to also handle mov-immediate was a new feature in Excavator.

Fortunately Bulldozer was obsoleted by AMD's much better Zen architecture which has 4 full scalar integer ALU ports / execution units. (And a wider front end and a uop cache, good caches, and generally doesn't suck in a lot of the ways that Bulldozer sucked.)


Is there a way to measure it?

yes, but generally not in a function you call with call. Instead put it in an unrolled loop so you can run it lots of times with minimal other instructions. Especially useful to look at CPU performance counter results to find front-end / back-end uop counts, as well as just the overall time for your loop.

You can construct your loop to measure latency or throughput; see RDTSCP in NASM always returns the same value (timing a single instruction). Also:

Generally you don't need to measure yourself (although it's good to understand how, that helps you know what the measurements really mean). People have already done that for most CPU microarchitectures. You can predict performance for a specific CPU for some loops (if you can assume no stalls or cache misses) based on analyzing the instructions. Often that can predict performance fairly accurately, but medium-length dependency chains that OoO exec can only partially hide makes it too hard to accurately predict or account for every cycle.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606