The sse tag wiki has a couple guides to vectorization, including these slides from a talk which are comprehensible on their own which have some great examples of transforming your data structures to enable vectorization (and classic pitfalls like putting [x,y,z,w]
geometry vectors into single SIMD vectors).
The classic use-case for SIMD is when there are a lot of independent operations, i.e. no serial dependencies inside the loop, like z[i] = d * x[i] + y[i]
. Or if there are, then only with associative operations that let you reorder. (e.g. summing an array or a similar reduction).
Also important that you can do it without a lot of shuffling; ideally all your data lines up "vertically" in vectors after loading from contiguous memory.
Having many conditions that go different ways for adjacent elements is usually bad for SIMD. That requires a branchless implementation, so you have to do all the work of both sides of every branch, plus merging. Unless you can check that all 4 (or all 16 or whatever) elements in your vector go the same way
There are some things that can be vectorized, even though you might not have expected it, because they're exceptions to the usual rules of thumb. e.g. converting an IPv4 dotted-quad string into a 32-bit IPv4 address, or converting a string of decimal digits to an integer, i.e. implementing atoi()
. These are vectorized with clever use of multiple different tricks, including a lookup-table of shuffle-masks for PSHUFB with a vector-compare bitmap as an index for the LUT.
So once you know some tricks, you always rule out a vectorized implementation quickly just based on a few rules of thumb. Even serial dependencies can sometimes be worked around, like for SIMD prefix sums.