7

Are SIMD instructions built for vector numerical calculations only? Or does it lend itself well to a class of string manipulation tasks like, writing rows of data to a text file where the order of the rows does not matter? If so which APIs or libraries should I start with?

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
sprkv5
  • 107
  • 5
  • Short answer: I/O is so many orders of magnitude slower than the CPU that SIMD isn't going to help; the CPU can already run *much* faster than the I/O layer without SIMD help. – ShadowRanger Nov 23 '20 at 19:34
  • 1
    @ShadowRanger: Not true when you're reading text files that are hot in disk cache. Then it's basically memory bandwidth with some system call or page-fault overhead. At a couple cycles per byte for some scalar byte-at-a-time loop, a 4GHz CPU could only go 2GB/s. That's only about 2x faster than a modern consumer SSD, and some byte-at-a-time code can be much slower than that. And server I/O can be faster. And of course memory bandwidth is much higher. I/O is often slow, but not always *orders of magnitude*. Modern systems aren't always breathing through a straw. – Peter Cordes Nov 25 '20 at 06:04

3 Answers3

6

Yes! And this is actually done in high performance parsing libraries. One example: simdjson- a parser than can parse gigabytes of JSON per second. There's an About simdjson section in the readme, which has a link to a talk that goes over some of the implementation details.

SIMD instructions operate on numeric values, but once you're at that level, "text" is just numeric values, e.g. UTF-8 codepoints are just unsigned 8-bit integers, with plenty of SIMD support. Processing bitmaps is full of operations on multiple 8-bit unsigned integers in parallel, and it just so conveniently happens that this is so common that SIMD instruction sets cover these operations, and plenty of them are thus also usable for text processing.

I/O is so many orders of magnitude slower than the CPU

Not really. It is slower, but when the CPU has to do tasks that kill the streaming performance, such as branch mispredictions, cache misses, or wasting lots of speculative execution resources on dead-ends, the CPU can very easily not keep up with I/O. Modern network cards used for fast storage access or multi-machine communications can saturate the CPU's memory ports. All of them. And keep them that way. But that's state of the art and quite expensive at the moment (bonded 50 GBit links and such). Sequential, byte-at-a-time parser code is way slower than that.

Kuba hasn't forgotten Monica
  • 88,505
  • 13
  • 129
  • 275
  • Nievely I would say SIMD would not help since for long strings memory bandwidth would be the bottleneck. Why is this not the case? – doron Nov 23 '20 at 19:56
  • I see your point - treating everything like a hex or binary or integers. So I can basically use SIMD if I know what to stuff into those wide registers. – sprkv5 Nov 23 '20 at 19:56
  • Also, I'm on a system with Gigabit connectivity and the data pull of 50 million records saturates 32GB of memory pretty fast. As I mentioned in a comment: My goal is to 1. convert data pulled from DB to CSV rows and then write to a file as fast as possible, 2. keep the memory from running out (reuse). – sprkv5 Nov 23 '20 at 20:02
  • Read about simdjson - their paper or their video, both are pretty good. That will give you an idea how to go about it. I'm pretty sure that you'll be able to keep up if you have NVMe storage - even on a 5 year old laptop, assuming that the CSV format won't grow the data much compared to what came in over the wire (otherwise you'll need a RAID). The CPU should very easily keep up with all that. – Kuba hasn't forgotten Monica Nov 23 '20 at 20:42
  • 1
    @doron Memory bandwidth is something like a cacheline every couple of clock cycles (order of magnitude), so very fast code can saturate it, but even simdjson can't keep up with memory - a couple gigabytes/s is far below what memory can do. Widest SIMD operations will let you do one or two operations on an entire cacheline per clock cycle, but you'll still need quite a few of those operations to carry out your task. – Kuba hasn't forgotten Monica Nov 23 '20 at 20:46
  • 1
    @doron: Even if data has to come from DRAM, modern desktop / laptop CPUs can load about 8 bytes per core clock cycle, within a factor of 2 of that anyway. Good luck keeping up with that with byte-at-a-time scalar loops. Besides, if you just did a `read()` system call to get the kernel to memcpy some data from a network buffer or pagecache into your process's memory, the data might still be hot in L3 cache, or even L2. Xeon CPUs can even DMA into L3 cache, or something like that, so aiming for memory bandwidth is a pretty low / unambitious goal, and a poor excuse for not optimizing. – Peter Cordes Nov 24 '20 at 02:32
  • 1
    Besides, fewer instructions to process the same data lets OoO exec see farther ahead, and start demand-loads for later pages / cache lines earlier in cases where HW prefetch wouldn't (e.g. across page boundaries.) It can also be more hyperthreading-friendly, leaving the HT sibling core with better throughput if anything's running on it. (Maybe nothing if there aren't a lot of threads active). Also, if SIMD is efficient enough, it may save power: tracking instructions through the pipeline is a large part of the cost, not the execution units themselves. – Peter Cordes Nov 24 '20 at 02:36
5

Yes, especially for ASCII e.g. Convert a String In C++ To Upper Case. Or checking for valid UTF-8 (https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/), or checking if a string happens to be the ASCII subset of UTF-8. (If so, you know you have fixed-width characters which is very useful for other things.)

As Daniel Lemire reported, an early attempt at UTF-8 validation gave "a few CPU cycles per character." But with SIMD, he and collaborators were able to achieve ~1 instruction per byte, for net speeds of ~12GB/s. (vs. DRAM bandwidth of a Haswell desktop being ~25GB/s, or Skylake at 34GB/s with DDR4-2133).


Of course, most C libraries already have hand-written asm implementations of functions like strlen, strcpy, strcasecmp, strstr, etc. that use SIMD if it's a win (like on x86-64 where pmovmskb allows relatively efficient compare/branch on any/all SIMD compare results being true or false.) The first part of my answer on Why does glibc's strlen need to be so complicated to run quickly? has some links to hand-optimized asm that glibc actually uses on mainstream platforms, instead of the portable plain C fallback the question is asking about.

https://github.com/WojciechMula/sse4-strstr has a variety of strstr implementations. Substring searching is a much harder problem, with non-trivial algorithm choices as well as just brute-force. The SSE4.2 "string" instructions may help for that, but if not then SIMD vector compares definitely can for better brute-force building blocks.

(SSE4.2 "string" instructions like pcmpistri are definitely worse for memcmp / strcmp and strlen where plain SSE2 (or AVX2) is better. See How much faster are SSE4.2 string instructions than SSE2 for memcmp? and https://www.strchr.com/strcmp_and_strlen_using_sse_4.2)

You can even do cool tricks with looking up a shuffle control vector based on a vector compare bitmap, e.g. Fastest way to get IPv4 address from string or How to implement atoi using SIMD?. Although I'm not sure the SIMD atoi is a win vs. scalar, especially for short numbers.


Naively I would say SIMD would not help since for long strings memory bandwidth would be the bottleneck. Why is this not the case?

DRAM bandwidth is really pretty good compared to modern CPU speeds, especially when the data comes in byte chunks, not 8-byte double chunks. And data is often hot in L3 cache after copying (e.g. from a read system call).

Even if data has to come from DRAM, modern desktop / laptop CPUs can load about 8 bytes per core clock cycle, within a factor of 2 of that anyway, especially if this core isn't competing with other bandwidth-intensive code on other cores. Good luck keeping up with that with byte-at-a-time scalar loops.

Besides, if you just did a read() system call to get the kernel to memcpy some data from a network buffer or pagecache into your process's memory, the data might still be hot in L3 cache, or even L2. Xeon CPUs can even DMA into L3 cache, or something like that. Aiming for memory bandwidth is a pretty low / unambitious goal, and a poor excuse for not fully optimizing a function if it actually gets use a lot.

Fewer instructions to process the same data lets out-of-order exec "see" farther ahead, and start demand-loads for later pages / cache lines earlier in cases where HW prefetch wouldn't (e.g. across page boundaries). And also better overlap the string processing with earlier / later independent work.

It can also be more hyperthreading-friendly, leaving the HT sibling core with better throughput if anything's running on it. (Maybe nothing if there aren't a lot of threads active). Also, if SIMD is efficient enough, it may save energy: tracking instructions through the pipeline is a large part of the cost, not the integer execution units themselves. Higher power while running, but finishing sooner, is good: race to sleep. CPUs save much more power when fully idle than when just running "cheap" instructions.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • 1
    DRAM bandwidth is not disastrously bad for client processors with 2 or maybe 4 cores, but it is much less adequate in the server space, as I have documented for the last 29 years: https://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/ – John D McCalpin Nov 25 '20 at 17:43
0

SIMD instructions are used on a very low level. Writing data to a text file is a much higher level, involving buffered I/O etc.

You might use SIMD, e.g., to convert a string from lower case to upper case. Wrapping SIMD into a library would be moot. You write the instructions yourself. Which also means that they are processor-specific (e.g. SSE variants on x86/AMD64).

For processing several rows of text in parallel, you might use micro-parallelization instead, e.g. offered by OpenMP or TBB.

However, if you stick to the example of writing to a the text file, we get to another territory of performance optimizations (I/O instead of computation).

ypnos
  • 45,954
  • 14
  • 88
  • 130