2

I'm curious about SIMD and wondering if it can handle this use case.

Let's say I have an array of 2048 integers, like [0x018A, 0x004B, 0x01C0, 0x0234, 0x0098, 0x0343, 0x0222, 0x0301, 0x0398, 0x0087, 0x0167, 0x0389, 0x03F2, 0x0034, 0x0345, ...]

Note how they all start with either 0x00, 0x01, 0x02, or 0x03. I want to split them into 4 arrays:

  • One for all the integers starting with 0x00
  • One for all the integers starting with 0x01
  • One for all the integers starting with 0x02
  • One for all the integers starting with 0x03

I imagine I would have code like this:

int main() {
   uint16_t in[2048] = ...;

   // 4 arrays, one for each category
   uint16_t out[4][2048];

   // Pointers to the next available slot in each of the arrays
   uint16_t *nextOut[4] = { out[0], out[1], out[2], out[3] };

   for (uint16_t *nextIn = in; nextIn < 2048; nextIn += 4) {

       (*** magic simd instructions here ***)

       // Equivalent non-simd code:
       uint16_t categories[4];
       for (int i = 0; i < 4; i++) {
           categories[i] = nextIn[i] & 0xFF00;
       }
       for (int i = 0; i < 4; i++) {
           uint16_t category = categories[i];
           *nextOut[category] = nextIn[i];
           nextOut[category]++;
       }
   }
   // Now I have my categoried arrays!
}

I imagine that my first inner loop doesn't need SIMD, it can be just a (x & 0xFF00FF00FF00FF00) instruction, but I wonder if we can make that second inner loop into a SIMD instruction.

Is there any sort of SIMD instruction for this "categorizing" action that I'm doing?

The "insert" instructions seem somewhat promising, but I'm a bit too green to understand the descriptions at https://software.intel.com/en-us/node/695331.

If not, does anything come close?

Thanks!

Verdagon
  • 2,156
  • 3
  • 19
  • 34
  • 3
    Scattered stores - you'll need AVX-512 for that. It probably won't be super-efficient either. – Paul R Sep 18 '18 at 16:54
  • Thanks for the lead, thats super interesting! After some reading, it looks like scattered stores can store a bunch of numbers to a bunch of corresponding pointers. How would I map from these numbers (0x00-0x03) to pointers? Also, why wouldn't it be efficient? – Verdagon Sep 18 '18 at 17:08
  • 2
    Because scatter stores will still bottleneck on cache misses, or if not are still limited to 1 element per clock like regular stores. Scatter instructions also decode to a lot of uops (more than gather loads), so they cost front-end throughput. You also have to detect conflicts (when multiple elements in a vector will go to the same bucket, you need to write them to sequential addresses, not stepping on each other, and you have to increment the per-bucket position counters by the right amounts.) – Peter Cordes Sep 18 '18 at 20:49
  • 2
    So that's a lot of extra work that a SIMD version has to do compared to doing 1 element at a time. Maybe with efficient `vpconflictd` (e.g. on KNL but not Skylake-avx512) you can come out ahead. This is similar to a histogram problem (where you're incrementing an array of per-bucket counters), but harder because you have to actually still preserve each element. – Peter Cordes Sep 18 '18 at 20:49
  • Thanks for all this information, I super appreciate it! I was also wondering about this conflict situation, sounds like vpconflictd is the closest we can get. I'll do some reading up on the histogram problem, thanks for the lead! – Verdagon Sep 18 '18 at 23:29

1 Answers1

3

You can do it with SIMD, but how fast it is will depend on exactly what instruction sets you have available, and how clever you are in your implementation.

One approach is to take the array and "sift" it to separate out elements that belong in different buckets. For example, grab 32 bytes from your array which will have 16 16-bit elements. Use some cmpgt instructions to get a mask where which determines whether each element falls into the 00 + 01 bucket or the 02 + 03 bucket. Then use some kind of "compress" or "filter" operation to move all masked elements contiguously into one end a register and then same for the unmasked elements.

Then repeat this one more time to sort out 00 from 01 and 02 from 03.

With AVX2 you could start with this question for inspiration on the "compress" operation. With AVX512 you could use the vcompress instruction to help out: it does exactly this operation but only at 32-bit or 64-bit granularity so you'd need to do a couple at least per vector.

You could also try a vertical approach, where you load N vectors and then swap between them so that the 0th vector has the smallest elements, etc. At this point, you can use a more optimized algorithm for the compress stage (e.g,. if you vertically sort enough vectors, the vectors at the ends may be entirely starting with 0x00 etc).

Finally, you might also consider organizing your data differently, either at the source or as a pre-processing step: separating out the "category" byte which is always 0-3 from the payload byte. Many of the processing steps only need to happen on one or the other, so you can potentially increase efficiency by splitting them out that way. For example, you could do the comparison operation on 32 bytes that are all categories, and then do the compress operation on the 32 payload bytes (at least in the final step where each category is unique).

This would lead to arrays of byte elements, not 16-bit elements, where the "category" byte is implicit. You've cut your data size in half, which might speed up everything else you want to do with the data in the future.

If you can't produce the source data in this format, you could use the bucketing as an opportunity to remove the tag byte as you put the payload into the right bucket, so the output is uint8_t out[4][2048];. If you're doing a SIMD left-pack with a pshufb byte-shuffle as discussed in comments, you could choose a shuffle control vector that packs only the payload bytes into the low half.

(Until AVX512BW, x86 SIMD doesn't have any variable-control word shuffles, only byte or dword, so you already needed a byte shuffle which can just as easily separate payloads from tags at the same time as packing payload bytes to the bottom.)

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
  • Yeah, filter and left-packing should work well for 2, 3 or 4 buckets, if you can efficiently left-pack. A cache-blocked multi-step approach for 4..16 buckets could be good, but copying the data around too many times will stop being worth it at some point. Cache locality (only 2 to 4 output streams) does help the multi-step left-pack vs. a one-pass histogram of pointers. I thought of the start of this idea while commenting on the question, but I think rejected it because I was imagining more buckets and didn't think of the multi-pass idea. – Peter Cordes Sep 19 '18 at 01:15
  • Vertical per-element sort with packed min/max instructions can't rule out having an element for a low bucket in the most "max" vector. So I don't think it saves you any work, but it might lead to better cache behaviour if you're usually doing bigger blocks of stores before touching other cache lines. You might `pcmpgtd` / `pmovmskb` / `test+jcc` to skip the left-pack / store for the low buckets on a vector that *usually* doesn't have any elements for the low buckets, though. (Or `ptest`/`jcc` with a mask that checks for high bits of all elements being all-zero.) – Peter Cordes Sep 19 '18 at 01:20
  • Oh, I just looked at the question again, I didn't realize how few buckets they were using. Yeah one-step left-pack should be great. I'd probably go with 128-bit vectors for efficient left-packing because AVX2 doesn't have a lane-crossing 16-bit-element shuffle, so you need `pshufb`. And there's no `vpcompressw`. Hmm, unpacking to 32-bit elements might be a win here, especially with AVX512. – Peter Cordes Sep 19 '18 at 01:23
  • I don't know what the limiting factor is without writing it all out, but it seems likely that 256-bit vectors could still be a win: especially if you can do the `pshufb` all at once with a single LUT entry that spans both lanes. Then you only have to break out the stores separately. About the vertical sort: yes you can't rule anything out, but as you suggest you can probabilistically skip many of the operations, or you can record which elements are out of place and then handle them in a different way (e.g., a scalar fixup step after you've done the "most common" entry type in a SIMD loop). – BeeOnRope Sep 19 '18 at 01:57
  • A 256-bit `vpshufb` to left-pack the 2 lanes separately still needs separate `popcnt` after splitting up the `vpmovmskb` result, and needs one wide shuffle-control vector. I guess you can load / `vinserti128` the two halves of the shuffle mask, after you generate them separately with BMI2 `pdep`, or from a 256 * 16-byte LUT (yuck), or 256 * 8-byte LUT with `vpmovzxbw` or load + `vpunpcklbw` to repeat every byte, and `vpaddb` to get the n, n+1 pair of indices. Yeah maybe 256 * 16 = 4096 bytes for a LUT is a better bet, if this will be used repeatedly. – Peter Cordes Sep 19 '18 at 02:14
  • @PeterCordes - yeah I've tried to think of good ways to shuffle the two lanes so they can be handled efficiently "as a unit" as much as possible: e.g., shuffling them in opposite directions (low half to the right and high half to the left) so they "meet in the middle" of the vector - but then how do you store that? Or doing a lane-crossing shuffle so that you get all your elements contiguous (i.e., fill up one late entirely if the original vector was more than half full), but that seems also awkward here since there is no variable mask `vpermw`. – BeeOnRope Sep 19 '18 at 02:54
  • 1
    Since we don't need to preserve order in this case (the requirement is weaker than a true "filter" or "compress" operation), `vpermd` actually seems to work for lots of configuration since you can swap "full" DWORDs from on lane into the other to try to fill it and make things contiguous, but some cases don't work (like 5 and 3 elements) and anyways it seems expensive. Maybe the lack of order presevation could also lead to a simplification in the LUT or other approach, but I'm not seeing it. – BeeOnRope Sep 19 '18 at 02:55
  • 1
    [AVX512BW `vpermw`](http://felixcloutier.com/x86/VPERMD:VPERMW.html) is variable-mask, but costs multiple uops on SKX. Interesting point about not preserving order, that does allow for unpack to dword, filter/left-pack, and then pack two results with in-lane `vpackusdw` or something. Also, agreed that the OP should store the category bytes separately from the data bytes. Sub-dword operations might as well be byte ops, unless you have AVX512BW. – Peter Cordes Sep 19 '18 at 03:03
  • @PeterCordes - yes, but one problem is that the LUTs grow really big when you pack twice the number of elements in each vector. So using a full 256-bit wide LUT might not be an option. Perhaps generating the shuffle mask by hand using whatever tricks could work. Maybe there is also a hybrid strategy lurking out there with a smaller LUT and some ALU ops... – BeeOnRope Sep 19 '18 at 03:06
  • Right, I was replying to the end of [this comment](https://stackoverflow.com/questions/52389997/can-i-use-simd-to-bucket-sort-categorize/52396540?noredirect=1#comment91739148_52396540) where you said "there's no variable mask `vpermw`. Maybe you meant "in AVX2"? Definitely hard to use, and byte elements do actually make it harder to create masks. So yeah, filtering them into buckets in the first place would be so much better. – Peter Cordes Sep 19 '18 at 03:09
  • 1
    @PeterCordes - maybe there was confusion, I wasn't disagreeing or even really referring to `vpermw` (and yes I meant in AVX2) there. I was just agreeing with you that you might as well use byte ops (in AVX2) and just saying that if you pack the payload bytes densely your LUTs will blow up. So it was about packing the payload bytes and how it might lead to to different techniques since the tradeoffs are different (you really do need byte-granular shuffles not just using byte-shuffles to shuffle WORD elements, so ... big LUTs). – BeeOnRope Sep 19 '18 at 03:54
  • 1
    You can always work in 8x1-byte chunks with a `movq` load / compare / `pmovmskb`. Then you only need a 256 * 8-byte LUT, and can use it without any extra instruction to expand it. Or `pdep`. So denser packing doesn't *hurt*, it just might not help. – Peter Cordes Sep 19 '18 at 04:11
  • @PeterCordes - yes, for sure. Of course the doesn't hurt may not hold if you have to pre-process your data to get it like that, versus being able to generate it like that in the first place. – BeeOnRope Sep 19 '18 at 04:15