Shifting SSE/AVX registers 32 bits left and right while shifting in zeros

Question

I want to shift SSE/AVX registers multiples of 32 bits left or right while shifting in zeros.

Let me be more precise on the shifts I'm interested in. For SSE I want to do the following shifts of four 32bit floats:

shift1_SSE: [1, 2, 3, 4] -> [0, 1, 2, 3]
shift2_SSE: [1, 2, 3, 4] -> [0, 0, 1, 2]

For AVX I want to shift do the following shifts:

shift1_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 1, 2, 3, 4, 5, 6, 7]
shift2_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 0, 1, 2, 3, 4, 5, 6]
shift3_AVX: [1, 2, 3, 4 ,5 ,6, 7, 8] -> [0, 0, 0, 0, 1, 2, 3, 4]

For SSE I have come up with the following code

shift1_SSE = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)); 
shift2_SSE = _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40);
//shift2_SSE = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8));

Is there a better way to do this with SSE?

For AVX I have come up with the following code which needs AVX2 (and it's untested). Edit (as explained by Paul R this code won't work).

shift1_AVX2 =_mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 4)));
shift2_AVX2 =_mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 8)));
shift3_AVX2 =_mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 12)));

How can I do this best with AVX not AVX2 (for example with _mm256_permute or _mm256_shuffle`)? Is there a better way to do this with AVX2?

Edit:

Paul R has informed me that my AVX2 code won't work and that AVX code is probably not worth it. Instead for AVX2 I should use _mm256_permutevar8x32_ps along with _mm256_and_ps. I don't have a system with AVX2 (Haswell) so this is hard to test.

Edit: Based on Felix Wyss's answer I came up with some solutions for AVX which only needs 3 intrisnics for shift1_AVX and shift2_AVX and only one intrinsic for shift3_AVX. This is due to the fact that _mm256_permutef128Ps has a zeroing feature.

shift1_AVX

__m256 t0 = _mm256_permute_ps(x, _MM_SHUFFLE(2, 1, 0, 3));       
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 41);          
__m256 y = _mm256_blend_ps(t0, t1, 0x11);

shift2_AVX

__m256 t0 = _mm256_permute_ps(x, _MM_SHUFFLE(1, 0, 3, 2));
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 41);
__m256 y = _mm256_blend_ps(t0, t1, 0x33);

shift3_AVX

x = _mm256_permute2f128_ps(x, x, 41);

Felix Wyss · Answer 1 · 2013-10-25T00:08:08.643

7

You can do a shift right with _mm256_permute_ps, _mm256_permute2f128_ps, and _mm256_blend_ps as follows:

__m256 t0 = _mm256_permute_ps(x, 0x39);            // [x4  x7  x6  x5  x0  x3  x2  x1]
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 0x81);  // [ 0   0   0   0  x4  x7  x6  x5] 
__m256 y  = _mm256_blend_ps(t0, t1, 0x88);         // [ 0  x7  x6  x5  x4  x3  x2  x1]

The result is in y. In order to do a rotate right, set the permute mask to 0x01 instead of 0x81. Shift/rotate left and larger shifts/rotates can be done similarly by changing the permute and blend control bytes.

edited Oct 25 '13 at 00:08

answered Oct 23 '13 at 03:46

Felix Wyss

81
2

1

That's more instructions than I expected. With SSE it can be done with only one instruction/intrinsic (`_mm_slli_si128`). I thought with AVX2 I could do it with two intrinsics `_mm256_permute2f128_ps` and `_mm256_and_ps`. – Z boson Oct 23 '13 at 06:24
I just realized there is an even easier solution using blend. I edited the answer. – Felix Wyss Oct 23 '13 at 15:02
That's a much better solution. I misunderstood and though this was AVX2 code. This is AVX code. I think `shift3_AVX` can be done in two instructions with AVX. – Z boson Oct 24 '13 at 07:36
I edited my question using your solution. Your solution shifts them the wrong way but the idea is correct. Thank you! – Z boson Oct 24 '13 at 09:16
2

I figured out a way to do shift3_AVX in one intrinsic. `_mm256_permute2f128_ps` has a [zeroing option]( http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B45DFF8A-A71A-4DDB-B77C-BF48A17AFCCE.htm). So shift3_AVX =`_mm256_permute2f128_ps(x, x, 41);` – Z boson Oct 24 '13 at 13:04
Actually, using the zeroing I found a way to reduce your solution by one intrinsic. If you set the control word in `_mm256_permutef128_ps` to 41 it will not only swap high and low but it will zero the high so you don't have to zet the zero later! – Z boson Oct 24 '13 at 13:14
Good idea. I amended the answer. Note that the constant must be `0x81`, not `41`! – Felix Wyss Oct 25 '13 at 00:08
41 works for me. Maybe I was not clear but I'm loading an array and want to shift it right while shifting in zeros. Let's say the array has values `array[] = {1,2,3,4,5,6,7,8}`. I load it with `__m256 x = _mm256_load_ps(array)`. Now I want to shift it to {0,1,2,3,4,5,6,7} but your code does {2,3,4,5,6,7,8,0} (at least the previous version did). In any case you have a great answer. – Z boson Oct 25 '13 at 06:53

score 5 · Accepted Answer · edited Jul 27 '16 at 09:00

5

Your SSE implementation is fine but I suggest you use the _mm_slli_si128 implementation for both of the shifts - the casts make it look complicated but it really boils down to just one instruction for each shift.

Your AVX2 implementation won't work unfortunately. Almost all AVX instructions are effectively just two SSE instructions in parallel operating on two adjacent 128 bit lanes. So for your first shift_AVX2 example you'd get:

0, 0, 1, 2, 0, 4, 5, 6
----------- ----------
 LS lane     MS lane

All is not lost however: one of the few instructions which does work across lanes on AVX is _mm256_permutevar8x32_ps. Note that you'll need to use an _mm256_and_ps in conjunction with this to zero the shifted in elements. Note also that this is an AVX2 solution — AVX on its own is very limited for anything other than basic arithmetic/logic operations so I think you'll have a hard time doing this efficiently without AVX2.

edited Jul 27 '16 at 09:00

Jens

6,642
6
45
64

answered Oct 22 '13 at 12:21

Paul R

195,989
32
353
519

How do I use mm_slli_si128 without the intrisic casts? When I try it it says something like no suitable conversion for __m128 to __m128i or vice versa. – Z boson Oct 22 '13 at 12:32
The casts are just there to keep the compiler happy (MSVC I'm guessing?) - they don't actually generate any code. So your code is fine, I was just saying to use the `_mm_slli_si128` implementation for both shifts rather than the `_mm_shuffle_ps` alternative for the second one. – Paul R Oct 22 '13 at 12:34
Oh, I understand what you mean. Yeah, I'm using MSVC2013. I naively assumed that the `_mm_shuffle_ps` would be faster than `_mm_slli_si128`. That's why I switched. Not because of the casts. I did not test which one is faster. It was only a guess. – Z boson Oct 22 '13 at 12:39
Well with `_mm_shuffle_ps` you need a zero vector, so it's possible that an additional instruction will be needed to generate this, and it also increases register pressure, hence my recommendation to stick with `_mm_slli_si128` which is a single instruction, single register solution. – Paul R Oct 22 '13 at 12:41
Okay, that makes sense mostly. What does "register pressure" mean? – Z boson Oct 22 '13 at 12:43
1

You only have 8 SSE registers in 32 bit mode and 16 registers in 64 bit mode. The more temporary variables the compiler can keep in registers the better performance is likely to be. If your code requires too many registers then the compiler has to "spill" registers to memory. So when you have two alternate solutions and one requires fewer temporary registers then that's the one to go for if there are no other factors to be considered. – Paul R Oct 22 '13 at 12:46
Thank you. Thank answers my questions. I'm looking at some code AVX2 code by Agner Fog. It appears that the use of _mm256_permutevar8x32_ps depends on the compiler. MSVC (V11 beta) had/has the operands in the wrong order and so does GCC 4.70. ICC has it correct. – Z boson Oct 22 '13 at 12:51
Nasty - if this needs to be portable then you'll probably want to wrap the intrinsic up into a macro with some ifdefs. – Paul R Oct 22 '13 at 13:05
If you want to see why I asked this question see my answer here http://stackoverflow.com/questions/19494114/parallel-prefix-cumulative-sum-with-sse/19519287#19519287 – Z boson Oct 22 '13 at 13:29
Interesting - it seems there are several different ways of doing prefix sum - I wonder if there might be another approach which is more amenable to SIMD, i.e. something where you don't need to do horizontal operations on every iteration? – Paul R Oct 22 '13 at 15:56
Yeah, that's why I asked that question. I can use SIMD with vertical operators on the second pass but I need horizontal operators on the first pass to use SIMD and they are slower than just using the x87. But the number of shifts and additions for the first pass with SIMD goes as the log of the SIMD width so for a wide enough SIMD unit SIMD can win out. Maybe it will be useful with AVX-512. – Z boson Oct 23 '13 at 06:39
I haven't looked at this in any detail but I have a gut feeling that an approach using `_mm_hadd_ps` might be worth investigating... – Paul R Oct 23 '13 at 08:16
I though about that but can't see how to do it efficiently. For a different question I wondering about switching execution units. Integer vs. float. Does `_mm_slli_si128` cause me to switch execution units [why do some sse instructions specify moveing float values](http://stackoverflow.com/questions/16294198/why-do-some-sse-mov-instructions-specify-that-they-move-floating-point-values). I should probably ask this as a separate question. – Z boson Oct 23 '13 at 08:52
For AVX I came up with some solutions for which only needs 3 intrisnics for shift1_AVX and shift2_AVX and only one intrinsic for shift3_AVX. See my edits. – Z boson Oct 24 '13 at 13:15
Good use of `_mm256_permute2f128_ps`. This same technique would be useful for neighbourhood operations in image processing such as filtering etc. – Paul R Oct 24 '13 at 13:37
Yeah, now the AVX soltuion only uses one more intrinsic than the AVX2 solution. But `_mm256_permutevar8x32_ps.` has a latency of three and the AVX intrinsics are each only 1 latency so it's hard to say. – Z boson Oct 24 '13 at 13:42
1

I finally benchmarked the code and the SSE and AVX code are about twice as fast as the sequential code! I did not expect that. My overall boost is now about 7x on my 4 core ivy bridge system. I posted the code in my answer at [simd-prefix-sum-on-inte](http://stackoverflow.com/questions/10587598/simd-prefix-sum-on-intel-cpu/19496697#19496697) – Z boson Oct 25 '13 at 08:52
Cool - it would be interesting to try an AVX2 implementation on a Haswell CPU too. – Paul R Oct 25 '13 at 10:18
1

Yeah, that's going to have to wait until I find a Haswell system. I put a test of the code at [coliru](http://coliru.stacked-crooked.com/a/3b759ae947571cfd). I don't know what system it is. It does not have AVX as I had to remove the AVX code for it to run. The number of threads is 4 but I think it only has two cores because the results with OpenMP are not so impressive. In any case the gain is over 3x on that system and over 7x on my system. Don't worry about the errors. It's due to the floating point precision. I'm summing the counting number and comparing to the exact formula. – Z boson Oct 25 '13 at 10:52

Shifting SSE/AVX registers 32 bits left and right while shifting in zeros

2 Answers2

Linked