x86-64 alignment of data in a loop that reads 8 chars at a time?

Question

strlen:
   xor r8,r8

.Lalignlong:
    test rdi, 0xf
    je .LfindNull
    prefetch [rdi + 8]
    cmp  Byte PTR [rdi], 0
    je  .LansNoAdd
    inc r8
    inc rdi
    jmp .Lalignlong

# do while is faster than while because of less  jumps (Agner)
.LfindNull:
    mov  r9, 0xFEFEFEFEFEFEFEFF
    mov  r10, 0x8080808080808080 # citation: Bit Twiddling Hacks Sean Eron Anderson
    prefetch [rdi + 192]
    mov rcx, [rdi]
    lea    rax, [rcx + r9]
    not     rcx
    and     rcx, rax
    and     rcx, r10
    jne .Lanswer
    nop # no idea why this makes it 2 cycles faster. findloop changes from 4a -> 4b
.Lfindloop:
    prefetch [rdi + 420]
    mov rcx, [rdi + 8]
    add rdi, 8
    add r8, 8
    lea     rax, [rcx + r9]
    not     rcx
    and     rcx, rax
    and     rcx, r10
    je .Lfindloop
.Lanswer:
    bsf     rcx, rcx
    shr     rcx, 3
    lea rax, [rcx + r8]
    ret
.LansNoAdd:
    mov rax, r8
    ret

This should be the x86 64 bit assembly code for counting the length of a char string, and the address of the string is passed to RDI.

I don't understand the first .Lalignlong part; does that do the data alignment?

And if yes, how is it supposed to work? Especially the line test rdi, 0xf confuses me very much.

`test rdi, 0xf` `je .LfindNull` jumps to `.LfindNull` if all of the 4 least signficant bits of `rdi` are clear, i.e. if `rdi` is a multiple of 16. — Michael, May 12 '20 at 06:52
it that because if the memory adress is a multiple of 16, the loop will work faster on x86? — buku juku, May 12 '20 at 06:57
Yes, aligned loads can never be split across a 64-byte boundary between two cache lines, or worse a boundary between 4k pages. This also means that you avoid the risk of a page fault if the end of the string is close to the end of a page. [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739). It's aligning the *pointer*, given that the input `char *` might *not* have been aligned by 8. i.e. looping over the array in aligned 8-byte chunks instead of possibly misaligned 8-byte chunks. (They only need `test dil, 7` not 0xf) — Peter Cordes, May 12 '20 at 13:22
BTW, this has some minor inefficiencies (like an extra `add` inside the loop instead of doing a pointer subtract after) but it's somewhat silly to use this bithack algorithm algorithm at all on most x86-64 systems. x86-64 guarantees that SSE2 is available so you can check 16 bytes at a time, like this simple SSE2 version: [Why is this code 6.5x slower with optimizations enabled?](https://stackoverflow.com/a/55589634). See also [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/a/57676035) — Peter Cordes, May 12 '20 at 13:26

x86-64 alignment of data in a loop that reads 8 chars at a time?

0 Answers0