As usual for asm, there are various good ways to achieve what you want. The most important question is whether carry propagation between bytes is a possible problem or not.
Option 1 (simple add with carry propagation)
If you only care about the low 4 bytes of 64-bit RAX, you should probably just use EAX for 32-bit operand-size. (Writing a 32-bit register zero-extends into the full 64-bit register, unlike when you write to an 8 or 16-bit register.)
So, as mentioned in a comment, this is does the trick for one interpretation of your question.
add eax, 0x010101
If you really want every byte of RAX, that's 8 bytes. But only mov
supports 64-bit immediates, not add
. You can create a constant in another register:
mov rdx, 0x0101010101010101
add rax, rdx
The approach with a single wide add
above has the disadvantage that an overflow in a certain byte propagates to the next higher one. So it's not really 4 or 8 independent byte adds, unless you know that each individual byte won't overflow and carry into the next byte. (i.e. SWAR)
For example: If you have eax = 0x010101FF
and add the constant from above, you will not get 0x02020200
, but 0x02020300
(the least significant byte overflows into the second least significant one).
Option 2 (loop without carry propagation)
Since you indicated that you want to use a loop to solve your problem, a possible approach which also only takes two registers is this:
[global func]
func:
mov rax, 0x4141414141414141
mov rcx, 8
.func_loop: ; NASM local .label is good style within a function
inc al ; modify low byte of RAX without affecting others
rol rax, 8
dec rcx
jne .func_loop
; RAX has been rotated 8 times, back to its original layout
ret
This will increment the least significant byte of rax
(without affecting other bits of rax
), then rotate rax
by 8 bits to the left, and repeat.
You could rotate by 16 bits (4 times) and do
inc ah ; doing AH first happens to be better with Skylake's partial-register handling: inc al can run in parallel with this once AH is already renamed separately.
inc al
rol rax, 16
as the loop body, but modifying AH is usually worse for partial-register slowdowns than just modifying AL, although it should reduce overhead on CPUs like Ryzen that don't rename AH separately from RAX. (Fun fact: on Skylake this breaks even for latency while inc al
; inc ah
in that order is slower, because the inc ah
can't start until after inc al
, because modern Intel CPUs don't rename the low-8 partial registers separately from the full reg, only high-8.)
Note, that the loop
instruction is slow on Intel CPUs, and is functionally equivalent to this (but without modifying flags):
dec rcx
jne func_loop
Also note, that doing add al, 1
might on certain systems actually be slightly faster than doing inc al
, as discussed here.
(Editor's note: rol
with a count other than 1
only needs to modify CF, and inc
/dec
only modifies the other flags (SPAZO). So with good partial-flag renaming inc
/ rol
/ dec
won't couple the inc/rol dependency chain into the dec loop-counter dependency chain and make this any slower than it needs to be. (Tested on Skylake and it does in fact run at 2 cycles / iteration throughput for large loop counts). But dec
would be a problem on Silvermont where inc
/dec
do merge into FLAGS. Making one of them a sub
or add
would break the dependency chain through FLAGS.)
Option 3 (SIMD add without carry propagation)
Probably the most efficient way to achieve this overflow behavior is using the dedicated SSE2 SIMD instruction:
default rel ; use RIP-relative addressing by default
section .rodata
align 16 ; without AVX, 16-byte memory operands must be aligned
vec1: times 8 db 0x01
dq 0
section .text
[global func]
func:
mov rax, 0x4141414141414141
movq xmm0, rax
paddb xmm0, [vec1] ; packed-integer add of byte elements
movq rax, xmm0
ret
This will move the value of rax
to the lower part of xmm0
, perform a byte-wise addition of the predefined constant (which is 128 bits long, but the upper 64 bits are irrelevant to us and thus zero) and then write the result back to rax
again.
The output is as expected: rax = 0x01010101010101FF
yields 0x0202020202020200
(the least significant byte overflows).
Note that using a constant from memory would also be possible with integer add, instead of mov
-immediate.
MMX would allow using only an 8-byte memory operand, but then you'd need EMMS
before returning; the x86-64 System V ABI specifies that the FPU should be in x87 mode on call/ret.
A trick you can use instead of loading a constant from memory is to generate it on the fly. It's efficient to generate an all-ones vector with pcmpeqd xmm1, xmm1
. But how to use that to add 1
? SIMD right shift is only available with word (16-bit) or larger elements so it would take a couple instructions to transform that to a vector of 0x0101...
. Or SSSE3 pabsb
.
The trick is that adding 1
is the same as subtracting -1
, and all-ones is two's complement -1
.
movq xmm0, rax
pcmpeqd xmm1, xmm1 ; set1( -1 )
psubb xmm0, xmm1 ; packed-integer sub of (-1) byte elements
movq rax, xmm0
Note that SSE2 also has instructions for saturating add and subtract, with paddsb
or psubsb
for signed saturation and paddusb
or psubusb
for unsigned. (For unsigned saturation you can't use the subtract -1
trick; that would always saturate to 0 instead of wrapping back to 1 above the original value.)