3

I have recently started to learn assembly, and I have set up a small project for myself. The goal is to use loops. I want to move 0x414141 to RAX, then loop over RAX, and increment every byte so RAX would contain 0x424242 at the end of the code.

I have tried incrementing byte rax, but I always get errors from NASM when trying to compile it. Currently, I have working code that in the end, will increment RAX to be equal to 0x414144. I can't seem to find anything that looks/sounds close to what I want to do. (But how hard can it be, right?)

global _start

section .text
_start:
    mov rax, 0x414141
    mov rcx, 3
strLoop:
    inc rax
    loop strLoop

    mov rax, 60
    mov rdi, 0
    syscall
    ; ^ exit 

When I look at RAX in GDB, in this code, I would expect it to be 0x414144, which it is. However, I want to get my code to a point where it is 0x424242, which I guess would be the expected result in this project.

Michael Petch
  • 42,023
  • 8
  • 87
  • 158
  • 6
    Try adding 0x010101 to RAX (no loop needed)? – Michael Petch Aug 25 '19 at 17:40
  • 2
    @MichaelPetch - I guess it depends what overflow semantics they want :). – BeeOnRope Aug 25 '19 at 19:52
  • FYI, generally speaking, the registers can be named in instructions, but cannot be indexed -- you can't take the address of a register; you can't access a register indirectly. Memory can be indexed, have its address taken, accessed indirectly, so if you want to do indexing, it pretty much has to go in memory. For example, we can write a loop to increment each byte of an array in memory. – Erik Eidt Aug 25 '19 at 19:58
  • @BeeOnRope : true, somehow I had it in my head (although clearly not mentioned at all in the question) that they were dealing with letters. I guess when I saw 0x41 I was thinking ASCII strings lol. – Michael Petch Aug 25 '19 at 20:20
  • @ErikEidt: You *can* use a variable-count rotate by a multiple of 8 bit to sort simulate indexing a register. Lower latency but probably more uops than store / RMW a byte / reload the whole thing with a store-forwarding stall. And BTW, Jan's answer on this deserves more upvotes. IDK why it got a downvote. – Peter Cordes Aug 26 '19 at 08:33

1 Answers1

3

As usual for asm, there are various good ways to achieve what you want. The most important question is whether carry propagation between bytes is a possible problem or not.


Option 1 (simple add with carry propagation)

If you only care about the low 4 bytes of 64-bit RAX, you should probably just use EAX for 32-bit operand-size. (Writing a 32-bit register zero-extends into the full 64-bit register, unlike when you write to an 8 or 16-bit register.)

So, as mentioned in a comment, this is does the trick for one interpretation of your question.

 add   eax, 0x010101

If you really want every byte of RAX, that's 8 bytes. But only mov supports 64-bit immediates, not add. You can create a constant in another register:

 mov   rdx, 0x0101010101010101
 add   rax, rdx

The approach with a single wide add above has the disadvantage that an overflow in a certain byte propagates to the next higher one. So it's not really 4 or 8 independent byte adds, unless you know that each individual byte won't overflow and carry into the next byte. (i.e. SWAR)

For example: If you have eax = 0x010101FF and add the constant from above, you will not get 0x02020200, but 0x02020300 (the least significant byte overflows into the second least significant one).


Option 2 (loop without carry propagation)

Since you indicated that you want to use a loop to solve your problem, a possible approach which also only takes two registers is this:

[global func]
func:
    mov rax, 0x4141414141414141

    mov rcx, 8
.func_loop:             ; NASM local .label is good style within a function
    inc al              ; modify low byte of RAX without affecting others
    rol rax, 8
    dec rcx
    jne .func_loop
    ; RAX has been rotated 8 times, back to its original layout

    ret

This will increment the least significant byte of rax (without affecting other bits of rax), then rotate rax by 8 bits to the left, and repeat.

You could rotate by 16 bits (4 times) and do

inc ah           ; doing AH first happens to be better with Skylake's partial-register handling: inc al can run in parallel with this once AH is already renamed separately.
inc al
rol rax, 16

as the loop body, but modifying AH is usually worse for partial-register slowdowns than just modifying AL, although it should reduce overhead on CPUs like Ryzen that don't rename AH separately from RAX. (Fun fact: on Skylake this breaks even for latency while inc al ; inc ah in that order is slower, because the inc ah can't start until after inc al, because modern Intel CPUs don't rename the low-8 partial registers separately from the full reg, only high-8.)

Note, that the loop instruction is slow on Intel CPUs, and is functionally equivalent to this (but without modifying flags):

dec rcx
jne func_loop

Also note, that doing add al, 1 might on certain systems actually be slightly faster than doing inc al, as discussed here.

(Editor's note: rol with a count other than 1 only needs to modify CF, and inc/dec only modifies the other flags (SPAZO). So with good partial-flag renaming inc / rol / dec won't couple the inc/rol dependency chain into the dec loop-counter dependency chain and make this any slower than it needs to be. (Tested on Skylake and it does in fact run at 2 cycles / iteration throughput for large loop counts). But dec would be a problem on Silvermont where inc/dec do merge into FLAGS. Making one of them a sub or add would break the dependency chain through FLAGS.)


Option 3 (SIMD add without carry propagation)

Probably the most efficient way to achieve this overflow behavior is using the dedicated SSE2 SIMD instruction:

default rel        ; use RIP-relative addressing by default

section .rodata
align 16           ; without AVX, 16-byte memory operands must be aligned
vec1:  times 8 db 0x01
               dq 0

section .text
[global func]
func:
    mov    rax, 0x4141414141414141

    movq   xmm0, rax
    paddb  xmm0, [vec1]      ; packed-integer add of byte elements
    movq   rax, xmm0

    ret

This will move the value of rax to the lower part of xmm0, perform a byte-wise addition of the predefined constant (which is 128 bits long, but the upper 64 bits are irrelevant to us and thus zero) and then write the result back to rax again.

The output is as expected: rax = 0x01010101010101FF yields 0x0202020202020200 (the least significant byte overflows).

Note that using a constant from memory would also be possible with integer add, instead of mov-immediate.

MMX would allow using only an 8-byte memory operand, but then you'd need EMMS before returning; the x86-64 System V ABI specifies that the FPU should be in x87 mode on call/ret.


A trick you can use instead of loading a constant from memory is to generate it on the fly. It's efficient to generate an all-ones vector with pcmpeqd xmm1, xmm1. But how to use that to add 1? SIMD right shift is only available with word (16-bit) or larger elements so it would take a couple instructions to transform that to a vector of 0x0101.... Or SSSE3 pabsb.

The trick is that adding 1 is the same as subtracting -1, and all-ones is two's complement -1.

    movq     xmm0, rax
    pcmpeqd  xmm1, xmm1        ; set1( -1 )
    psubb    xmm0, xmm1        ; packed-integer sub of (-1) byte elements
    movq     rax, xmm0

Note that SSE2 also has instructions for saturating add and subtract, with paddsb or psubsb for signed saturation and paddusb or psubusb for unsigned. (For unsigned saturation you can't use the subtract -1 trick; that would always saturate to 0 instead of wrapping back to 1 above the original value.)

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
janw
  • 5,204
  • 5
  • 24
  • 43
  • 1
    RAX is 8 bytes wide; you'd want `mov rdx, 0x0101010101010101` / `add rax, rdx` because only `mov` allows a 64-bit immediate. (And unlike AArch64, it doesn't have compact encodings for repeated bit patterns.) You should at least mention that this doesn't stop overflow between bytes; for that you'd need to copy it to XMM0 and use `paddb`. – Peter Cordes Aug 25 '19 at 20:18
  • 1
    Or rotate RAX a byte at a time, doing `add al, 1`. (And BTW, don't use [the slow `loop` instruction](https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently)). (I forget if `rol` is weird with partial flags; `inc` might be a performance problem for partial-flag reasons. But if we care about performance, `paddb` is obviously best) – Peter Cordes Aug 25 '19 at 20:21
  • 1
    You don't need to leave "Edit" marks in your answer, that's what the edit history is for. You should replace old stuff instead of just adding new stuff. The only point in looping is doing 1 byte at a time so I'd just keep the ROL version and drop the `shl`/`loop` version. Especially since it still only does 4 bytes. – Peter Cordes Aug 25 '19 at 21:07
  • Thank you for those comments, I did not consider the overflow behavior. I have added examples for both proposed methods (rotate and xmm). Did not know of the partial flag stalls yet, thanks for the hint! I also added a link to a matching SO discussion. – janw Aug 25 '19 at 21:12
  • Oh, did not update the page to see your latest comment. Alright, then I will drop the "edit" mark and restructure the answer a bit. – janw Aug 25 '19 at 21:14
  • @PeterCordes I thought I remembered that AArch64 *removes* the complex immediate construction of 32-bit ARM. I had a look at the ARM and I didn't see a compact encoding for repeated bit patterns. – EOF Aug 25 '19 at 21:17
  • 1
    @EOF: Oh right, AArch64 has repeat-bit-pattern immediates only for bitwise booleans like AND and ORR. https://godbolt.org/z/KA5IY1. Notice it can `orr x0, x0, 72340172838076673` in one instruction, but for `add` it has to generate that constant in a 64-bit register first. (Using `orr x1, xzr, 0x0101...` with the zero register.) https://dinfuehr.github.io/blog/encoding-of-immediate-values-on-aarch64/ includes details: the pattern is a contiguous range of set bits (start/len) inside a 2, 4, 8, 16, 32, or 64-bit element that repeats to fill the register. – Peter Cordes Aug 25 '19 at 23:20
  • 1
    @JanWichelmann: I made some more tweaks to your answer to simplify some wording, and add some fun stuff :) – Peter Cordes Aug 26 '19 at 00:21
  • 2
    For the carry propagation there's another way: `result = ((x & 0x7F7F7F7F7F7F7F7F) + 0x0101010101010101) ^ (x & 0x8080808080808080)`. – Brendan Aug 26 '19 at 00:30
  • I also tried `inc ah`/`inc al`/`rol rax,16` in a dec/jnz loop. On Skylake that runs at the same overall speed as the simpler `inc al`/`rol rax,8` loop, i.e. 4 cycles per iteration instead of 2, including the AH-merging uop. Interestingly, reordering to do `inc al`/`inc ah`/`rol rax,16` makes it slower, 5 cycles. Intel's optimization manual mentions a front-end stall effect when inserting an AH merging uop for Sandybridge, like it has to issue by itself, and this might be a consequence of that. But anyway, fewer `uops_executed` and `uops_issue` counts, although we could get that by unrolling – Peter Cordes Aug 26 '19 at 00:42
  • @PeterCordes: Thank you for that additional information! Looks like I'm also learning a lot here ;) I have done another slight edit to add some more structure, since the answer got a bit convoluted with all those different approaches (I also got a downvote for some reason). – janw Aug 26 '19 at 08:28
  • IDK why someone would downvote this. It has the simple answer the OP was looking for early on where they can find it, and it has lots of interesting technical detail. Sometimes I think people downvote for answering "beginner" questions with any care about performance, but that's silly IMO. Performance is one of the major reasons for knowing anything about asm. As long as there's still an answer in there for beginners, it's fine IMO. – Peter Cordes Aug 26 '19 at 08:44