I'm curious about SIMD and wondering if it can handle this use case.
Let's say I have an array of 2048 integers, like [0x018A, 0x004B, 0x01C0, 0x0234, 0x0098, 0x0343, 0x0222, 0x0301, 0x0398, 0x0087, 0x0167, 0x0389, 0x03F2, 0x0034, 0x0345, ...]
Note how they all start with either 0x00, 0x01, 0x02, or 0x03. I want to split them into 4 arrays:
- One for all the integers starting with 0x00
- One for all the integers starting with 0x01
- One for all the integers starting with 0x02
- One for all the integers starting with 0x03
I imagine I would have code like this:
int main() {
uint16_t in[2048] = ...;
// 4 arrays, one for each category
uint16_t out[4][2048];
// Pointers to the next available slot in each of the arrays
uint16_t *nextOut[4] = { out[0], out[1], out[2], out[3] };
for (uint16_t *nextIn = in; nextIn < 2048; nextIn += 4) {
(*** magic simd instructions here ***)
// Equivalent non-simd code:
uint16_t categories[4];
for (int i = 0; i < 4; i++) {
categories[i] = nextIn[i] & 0xFF00;
}
for (int i = 0; i < 4; i++) {
uint16_t category = categories[i];
*nextOut[category] = nextIn[i];
nextOut[category]++;
}
}
// Now I have my categoried arrays!
}
I imagine that my first inner loop doesn't need SIMD, it can be just a (x & 0xFF00FF00FF00FF00)
instruction, but I wonder if we can make that second inner loop into a SIMD instruction.
Is there any sort of SIMD instruction for this "categorizing" action that I'm doing?
The "insert" instructions seem somewhat promising, but I'm a bit too green to understand the descriptions at https://software.intel.com/en-us/node/695331.
If not, does anything come close?
Thanks!