1

So I made this strlen a while ago and everything seemed fine. But I started noticing bugs with my codebase and after a while I tracked it down to this strlen function. I used SIMD instructions to write it and I am new to writing intrinsics so the code isn't probably the best it could be either.

Here is the function:

inline size_t strlen(const char* data) {
        const __m256i terminationCharacters = _mm256_setzero_si256();
        const size_t shiftAmount = ((size_t)&data) & 31;
        const __m256i* pointer = (const __m256i*) (data - shiftAmount);

        size_t length = 0;

        for (;; length += 32, ++pointer) {
            const __m256i comparingData = _mm256_load_si256(pointer);
            const __m256i comparison = _mm256_cmpeq_epi8(comparingData, terminationCharacters);

            if (!_mm256_testc_si256(terminationCharacters, comparison)) {
                const auto mask = _mm256_movemask_epi8(comparison);

                return length + _tzcnt_u32(mask >> shiftAmount);
            }
        }
    }
Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • 3
    If *(data - 1) == '\0' and data is not aligned then I think your loop terminates immediately, but the string could be longer. – James Griffin Apr 22 '21 at 21:19
  • I think the simplest approach to fixing this would be to either just use unaligned loads (is this actually significantly slower?) or to perform a single unaligned load of the first block of 32 characters and test for the termination character, then in your loop move to aligned loads but start at the aligned position after data. – James Griffin Apr 22 '21 at 21:30
  • Is this for x86_64? – Noah Apr 23 '21 at 00:40
  • 2
    @JamesGriffin: You can't use unaligned loads unless you check for being near the end of a page. C `strlen` has to work correctly if you pass it a pointer to a 3-byte string that's only say 5 bytes from the end of a page, and the next page is unmapped. [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739). Handling all this correctly is part of the trick; see the hand-written asm in glibc's `strlen` for example. Doing a page-cross check before a first unaligned vector can work, then go aligned (overlap is fine) – Peter Cordes Apr 23 '21 at 03:11
  • What kind of bugs? The code is nice and short, but what symptoms are we looking for? Wrong results? Segfaults? An important part of a [mcve] is a specific problem description, like a test-case where it fails, but if you don't have that you can at least say if it's segfaulting in this code or not. (And if so, use a debugger to find the details.) Although since you're aligning the pointer (correctly I think), that's not possible since I think you're also finding a terminator if there is one. – Peter Cordes Apr 23 '21 at 03:12
  • Wait a minute, `_tzcnt_u32(mask >> shiftAmount);` looks buggy, too. It should be `_tzcnt_u32(mask) - shiftAmount;` for any vectors after the first, otherwise you might get `32` if you shift out the only match because it was near the start of an aligned vector. Separate your startup check from your aligned-pointer loop body. And in the startup, check `mask >> sh` for being non-zero. – Peter Cordes Apr 23 '21 at 03:18
  • @PeterCordes Thank you Peter, I overlooked that, good catch. – James Griffin Apr 25 '21 at 12:15

2 Answers2

1

Your attempt to combine startup handling into the aligned-vector loop has at least 2 showstopper bugs:

  • You exit the loop if your aligned load finds any zero bytes, even if they're from before the proper start of the string. (@James Griffin spotted this in comments). You need to do mask >>= shiftAmount and check that for non-zero to see if there were any matches in the part of the load that comes after the start of the string. (Don't use _mm256_testc_si256, just movemask and check).

  • _tzcnt_u32(mask >> shiftAmount); is buggy for any vectors after the first. The whole vector comes from bytes after the start of the string, so you need tzcnt to see all of bits. Instead, you want _tzcnt_u32(mask) - shiftAmount, I think.

Make yourself some test cases with 0 bytes before the actual string but inside the first aligned vector. And test cases with the final 0 in different places relative to a vector, and non-zero and test your version against libc strlen. (Maybe even for some randomized 0-positions within the first 32 bytes, and then within the first 64 bytes after that.)

Your strategy for handling unaligned startup should work, if you separate it from the loop. (Is it safe to read past the end of a buffer within the same page on x86 and x64?).

Another option is a page-cross check before a first unaligned vector load from the actual start of the string. (But then you need a fallback to something else). Then go aligned: overlap is fine; as long as you calculate the final length correctly, it doesn't matter if you check the same byte twice for being zero.


(You also don't really want the compiler to be wasting instructions inside the loop incrementing a pointer and a separate length, so check the resulting asm. A pointer-subtract after the loop should do the trick. Even cast to uintptr_t.
Also, you can subtract the final zero-position from the initial function arg, instead of from the aligned pointer, so instead of subtracting shiftAmount twice, you're just not using it at all except for the initial alignment.)

Don't use the vptest intrinsic (_mm256_testc_si256) at all, even in the main loop when you should be checking all the bytes; it's not better for _mm_cmp* results. vptest is 2 uops and can't macro-fuse with a branch instruction. But vpmovmskb eax, ymm0 is 1 uop, and test eax,eax / jz .loop is another one macro-fused uop. And even better, you actually need the integer movemask result after the loop, so you already have it.


Related

  • Is it safe to read past the end of a buffer within the same page on x86 and x64?

  • Why does glibc's strlen need to be so complicated to run quickly? (includes links to hand-written x86-64 asm for glibc's strlen implementation.) Unless you're on a platform with a worse C library, normally you should use that, because glibc uses CPU detection during dynamic linking to select a good version of strlen (and memcpy, etc.) for your CPU. Unaligned-startup for strlen is somewhat tricky, and glibc I think makes reasonable choices, unless the function-call overhead is a big problem. It also has good loop-unrolling techniques for big strings (like _mm256_min_epu8 to get a zero in a vector element if either of 2 input vectors had a zero, so it can amortize the actual movemask/branch work over a whole cache-line of data). It might be too aggressive in ramping up to that for medium-length strings though.

    Note that glibc's licence is the LGPL, so you can't just copy code from glibc into your project unless your license is compatible. Even writing an intrinsics equivalent of its asm might be questionable.

  • Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? - a simple SSE2 strlen that doesn't handle misalignment, in hand-written asm. And comments on benchmarking.

  • https://agner.org/optimize/ - guides and instruction tables, and his subroutine library (in hand-written asm) includes a strlen. (But note it's GPL licensed.)

I assume some of the BSDs and MacOS have an asm strlen under a more permissive license you could use / look at if your project isn't GPL-compatible.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
-1

No offense but

size_t strlen(char *p)
{
    size_t ret_val = 0;

    while (*p++) ret_val++;

    retirn ret_val;
}

does its work very well since long long ago. Also, today's optimizing compilers get very tight code for it, and your code is impossible to read.

Luis Colorado
  • 8,037
  • 1
  • 10
  • 27
  • 1
    Unfortunately [today's optimizing compilers](https://godbolt.org/z/sx5sd47W1) produce very *naive* code for it, checking one byte at a time. – Nate Eldredge Apr 24 '21 at 19:44
  • Explain to me how a null byte can be found in memory without checking the full set of bytes between the pointer parameter and the first found `\0` character. If you have to look to them all, then there's no other alternative than using all cores and parallelize the operation, but not much more. – Luis Colorado Apr 24 '21 at 21:14
  • 1
    The entire point of the SIMD algorithms like the one OP is working on, and of the SIMD algorithms used in standard library implementations, is that you can check 32 bytes at a time with a couple of AVX instructions, still only using one core. That is dramatically faster than 32 `cmpb` in a loop. But compilers won't do this for you, likely because of the alignment games that are needed. – Nate Eldredge Apr 24 '21 at 21:17
  • The problem is that you cannot look _after_ the found '\0', so there's no possibility of searching but byte per byte, in sequence, or you can find yourself accessing the memory after the `\0`, and get a `SIGSEGV`. – Luis Colorado Apr 24 '21 at 21:19
  • 1
    But you can! That's the other essential trick here. At a low level, it *is* safe to load from memory that comes after the `\0`, provided that you don't cross a page boundary - segfaults only occur by touching unmapped pages, so just ensure you don't access any *pages* that you otherwise wouldn't. The alignment is what ensures that here. A 32-byte aligned load will never cross a page boundary, and we know that the *first* byte of those 32 is okay to touch (because the `\0` wasn't seen previously), so the entire 32 bytes is safe to load. – Nate Eldredge Apr 24 '21 at 21:22
  • Naturally it is undefined behavior at the level of C, but the point of these implementations is to take advantage of what we know about the behavior of our hardware, beyond what the C standard promises us. – Nate Eldredge Apr 24 '21 at 21:23
  • If you were implementing a `strnlen()` in which you specify a buffer size, then your approach is valid.... but imagine you have `char b[] = "";`and ask for `strlen()` with an algorithm that does it in chunks of 64bit. You'll access the seven bytes past the null, with an _undefined behaviour_ on top of the table. – Luis Colorado Apr 24 '21 at 21:25
  • 1
    The behavior of doing so is undefined at the level of C, but it's perfectly well defined by the x86-64 architecture: if the access includes a not-present page there will be a page fault, otherwise you'll successfully read whatever happens to be in physical memory. So we just have to avoid the former, and we know that pages are 4K aligned. If, say, `b` is at address `0x1234fffe`, then OP's code will mask off low bits and load 32 bytes from addresses `0x1234ffe0 - 0x1234ffff`, all of which are on the same accessible page as `b` itself - no page fault. – Nate Eldredge Apr 24 '21 at 21:32
  • I agree it would not be safe to load 32 or 8 bytes from address `0x1234fffe`, but the code here is not going to do that. – Nate Eldredge Apr 24 '21 at 21:33
  • ............ :) – Luis Colorado Apr 24 '21 at 21:37
  • *If you have to look to them all, then there's no other alternative than using all cores and parallelize the operation* - nope, other cores also don't know whether they can safely look at memory past the end of the current page, so it's not worth doing any inter-core communication just for at most 4096 bytes of work to be divided up. And yes, Nate's right, the entire point of this exercise is [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739), using the same tricks that good hand-written asm `strlen` implementations use. – Peter Cordes Apr 25 '21 at 02:05
  • yes, you can divide your page in slices and asign each core one slice.... you have to wait for them to finish to see if they have found a null... and ignore the ones assigned slices at higher addresses until all of them have finished... if none finds it, get to the next page.... but only inside the page... as you did. – Luis Colorado Apr 25 '21 at 19:42
  • Didn't see your reply earlier since you didn't @ notify me. (Was looking for something else when I saw this Q&A again). As I said, yes you *could* hand out work to another core for every page, but that's not much work for the amount of synchronization overhead. Best case for single core, the data is hot in L2 or L1d cache and one core can blaze through it before another core could even notice a store to a shared var. If cold in cache, at ~25GB/s it still only takes about 160 ns to read a 4k page, and inter-core latency is on the order of 20 to 50 nanoseconds (one-way I think) on a non-Xeon – Peter Cordes May 30 '21 at 02:25
  • https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5 has some Zen3 inter-core latency benchmarks getting as low as 15ns within one CCX, but 80 on another CCX. Highly unlikely to be worth parallelizing this way; if you care about strlen performance this much, take a buffer-size arg so the strlen knows how far it can read without risk of faulting. (And of course, keep track of string lengths so you don't need to scan for them in huge buffers, as much as possible, so huge-buffer strlen performance is even less relevant.) – Peter Cordes May 30 '21 at 02:28