You and @Rafael's answer are massively over-complicating your code.
You normally never want to use mov rdi, msg
with a 64-bit immediate of the absolute address. (See Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array)
Use default rel
and use cmp byte [msg], 'H'
. Or if you want the pointer in RDI so you can increment it in a loop, use lea rdi, [rel msg]
.
The only thing that's different between your branches is the RDI value. You don't need to duplicate the RAX setup or the syscall
, just get the right value in RDI and then have the branches rejoin each other. (Or do it branchlessly.)
@Rafael's answer is still loading 8 bytes from the string for some reason, like both loads in your question. Presumably this is sys_exit
and it ignores the upper bytes, only setting process exit status from the low byte, but just for fun let's pretend we actually want all 8 bytes loaded for the syscall while only comparing the low byte.
default rel ; use RIP-relative addressing modes by default for [label]
global start
section .rodata ;; read-only data usually belongs in .rodata
msg: db "Hello, World!", 10, 0
section .text
start:
mov rdi, [msg] ; 8 byte load from a RIP-relative address
mov ecx, 'H'
cmp dil, cl ; compare the low byte of RDI (dil) with the low byte of RCX (cl)
jne .notequal
;; fall through on equal
mov edi, 58
.notequal: ; .labels are local labels in NASM
; mov rdi, [rdx] ; still loaded from before; we didn't destroy it.
mov eax, 0x2000001
syscall
Avoid writing to AH/BH/CH/DH when possible. It either has a false dependency on the old value of RAX/RBX/RCX/RDX, or it can cause partial-register merging stalls if you later read the full register. @Rafael's answer doesn't do that, but the mov ah, 'H'
is dependent on the load into AL on some CPUs. See Why doesn't GCC use partial registers? and How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent - mov ah, 'H'
has a false dependency on the old value of AH on Haswell/Skylake, even though AH is renamed separately from RAX. But AL isn't, so yes, this might well have a false dependency on the load, stopping it from running in parallel and delaying the cmp
by a cycle.
Anyway, the TL:DR here is that you shouldn't mess around with writing AH/BH/CH/DH if you don't need to. Reading them is often ok, but can have worse latency. And note that cmp dil, ah
isn't encodeable, because DIL is only accessible with a REX prefix and AH is only accessible without.
I picked RCX instead of RSI because CL doesn't need a REX prefix, but since we need to look at the low byte of RDI (dil) we need a REX prefix anyway on the cmp. I could have use mov cl, 'H'
to save code-size, because there's probably no problem with a false dependency on the old value of RCX.
BTW, cmp dil, 'H'
would work just as well as cmp dil, cl
.
Or if we load the byte with zero-extension into the full RDI, we can use cmp edi, 'H'
instead of the low-8 version of it. (Zero-extending loads are the normal / recommended way to deal with bytes and 16-bit integers on modern x86-64. Merging into the low byte of the old register value is usually worse for performance, which is the reason Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?.)
And instead of branching, we could CMOV. This is sometimes better, sometimes not, for code-size and performance.
Version 2, only actually loading 1 byte:
start:
movzx edi, byte [msg] ; 1 byte load, zero extended to 4 (and implicitly to 8)
mov eax, 58 ; ASCII ':'
cmp edi, 'H'
cmove edi, eax ; edi = (edi == 'H') ? 58 : edi
; rdi = 58 or the first byte,
; unlike in the other version where it had 8 bytes of string data here
mov eax, 0x2000001
syscall
(This version looks a lot shorter, but most of the extra lines were whitespace, comments, and labels. Optimizing to cmp
-immediate makes this 4 instructions instead of 5 before the mov eax
/ syscall
, but other than that they're equal.)