SIMD versions of SHLD/SHRD instructions

Question

SHLD/SHRD instructions are assembly instructions to implement multiprecisions shifts.

Consider the following problem:

uint64_t array[4] = {/*something*/};
left_shift(array, 172);
right_shift(array, 172);

What is the most efficient way to implement left_shift and right_shift, two functions that operates a shift on an array of four 64-bit unsigned integer as if it was a big 256 bits unsigned integer?

Is the most efficient way of doing that is by using SHLD/SHRD instructions, or is there better (like SIMD versions) instructions on modern architecture?

For which architecture are you programming? If you're on x86 you may have instructions up to SSE3 [edit: as @Ruslan pointed out you may have AVX/AVX2 support in 32 bit mode], or on x86_64 up to AVX2 (unless you're very lucky and get to program for AVX512 on a big Intel coprocessor). If you're on ARM and have NEON support there are SIMD shift instructions as well. — Dalton, Sep 01 '16 at 16:42
Depends if that "172" is fixed, or just example value: as 172 is 21.5 bytes, allowing you to memmove the content by 21 bytes first, then shifting the 11 target bytes 4 times to right (ie. 3x `shrd`) and clearing the other 21 bytes with zero. If you have the value already in registers, check this question for many resources: http://stackoverflow.com/q/25248766/4271923 — Ped7g, Sep 01 '16 at 16:45
@Dalton you can use AVX2 in 32-bit mode too (limited to 8 `ymmN` registers though, as with `xmmN`). — Ruslan, Sep 01 '16 at 16:46
@Ruslan Thanks, I made an edit to the comment. You're right about the YMM register aliasing. Do you know if AVX512 variants have ZMM aliasing to the XMM registers as well? If I recall correctly they do alias to YMM, at least. — Dalton, Sep 01 '16 at 16:48
@Dalton yes, they are all extensions of the previous generations. This includes the added `ZMM16-ZMM31`, which are still accessible in the lower parts by corresponding `YMM` and `XMM` registers. — Ruslan, Sep 01 '16 at 16:52
OP, consider looking into these two references: (1) [ARM intrinsics reference](http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf) (PDF); (2) [Intel intrinsics reference](https://software.intel.com/sites/landingpage/IntrinsicsGuide/). — Dalton, Sep 01 '16 at 16:52
@Ruslan Right, okay thank you. That is what I suspected but I was not entirely sure. In 32 bit mode the AVX/AVX2 specific instructions are not available though, correct? That is why I said "on x86 you may have *instructions* up to SSE3". — Dalton, Sep 01 '16 at 16:54
@Dalton no, all the are available too (e.g. `vcvtpd2ps`, which is explicitly VEX-encoded version of `cvtpd2ps`, or new one `vextractf128`). It's just that some opcodes which would refer to higher registers like `YMM8` mean something different due to VEX being mapped to another instructions (which are removed in long mode), or undefined (result in #UD). — Ruslan, Sep 01 '16 at 17:00
@Ruslan Ahh, okay, that makes sense. Thank you for the clarification on those points! — Dalton, Sep 01 '16 at 17:01
If `172` is a compile-time constant, you should take advantage of that instead of using variable-count instructions. Fixed shuffles / shifts can be more efficient / flexible. 172 isn't a multiple of 8, though, so you do still need to move bits between elements. — Peter Cordes, May 05 '18 at 14:19

Johan · Accepted Answer · 2016-09-03T13:35:46.987

In this answer I'm only going to talk about x64.
x86 has been outdated for 15 years now if you're coding in 2016 it hardly makes sense to be stuck in 2000.
All times are according to Agner Fog's instruction tables.

Intel Skylake example timings*
The shld/shrd instructions are rather slow on x64.
Even on Intel skylake they have a latency of 4 cycles and uses 4 uops meaning it uses up a lot of execution units, on older processors they're even slower.
I'm going to assume you want to shift by a variable amount, which means a

SHLD RAX,RDX,cl        4 uops, 4 cycle latency.  -> 1/16 per bit

Using 2 shifts + add you can do this ~~faster~~ slower.

@Init:
MOV R15,-1
SHR R15,cl    //mask for later use.    
@Work:
SHL RAX,cl        3 uops, 2 cycle latency
ROL RDX,cl        3 uops, 2 cycle latency
AND RDX,R15       1 uops, 0.25 latency
OR RAX,RDX        1 uops, 0.25 latency    
//Still needs unrolling to achieve least amount of slowness.

Note that this only shifts 64 bits because RDX is not affected.
So you're trying to beat 4 cycles per 64 bits.

//4*64 bits parallel shift.  
//Shifts in zeros.
VPSLLVQ YMM2, YMM2, YMM3    1uop, 0.5 cycle latency.

However if you want it to do exactly what SHLD does you'll need to use an extra VPSLRVQ and an OR to combine the two results.

VPSLLVQ YMM1, YMM2, YMM3    1uop, 0.5 cycle latency.  
VPSRLVQ YMM5, YMM2, YMM4    1uop, 0.5 cycle latency.   
VPOR    YMM1, YMM1, YMM5    1uop, 0.33 cycle latency.

You'll need to interleave 4 sets of these costing you (3*4)+2=14 YMM registers.
Doing so I doubt you'll profit from the low .33 latency of VPADDQ so I'll assume a 0.5 latency instead.
That makes 3uops, 1.5 cycle latency for 256 bits = 1/171 per bit = 0.37 cycle per QWord = 10x faster, not bad.
If you are able to get 1.33 cycle per 256 bits = 1/192 per bit = 0.33 cycle per QWord = 12x faster.

'It’s the Memory, Stupid!'
Obviously I've not added in loop overhead and load/stores to/from memory.
The loop overhead is tiny given proper alignment of jump targets, but the memory
access will easily be the biggest slowdown.
A single cache miss to main memory on Skylake can cost you more than 250 cycles¹.
It is in clever management of memory that the major gains will be made.
The 12 times possible speed-up using AVX256 is small potatoes in comparison.

I'm not counting the set up of the shift counter in CL/(YMM3/YMM4) because I'm assuming you'll reuse that value over many iterations.

You're not going to beat that with AVX512 instructions, because consumer grade CPU's with AVX512 instructions are not yet available.
The only current processor that supports currently is Knights Landing.

*) All these timings are best case values, and should be taken as indications, not as hard values.
¹) Cost of cache miss in Skylake: 42 cycles + 52ns = 42 + (52*4.6Ghz) = 281 cycles.

Just to nit, cache misses to memory on Skylake aren't as bad as 1000 cycles (unless count page-faults). That can only happen if it was a cache miss to a very remote NUMA node. But that isn't really possible atm since multi-socket Skylake servers haven't been released yet. — Mysticial, Sep 01 '16 at 18:38
Huh, it's really weird that on SKL, VPSLLVQ is more efficient than the normal VPSLLQ (which takes the shift count from only the bottom element). It looks like SKL's VPSLLQ uses a port5 shuffle to broadcast the shift-count to every element of a vector, then feeds that to the VPSLLVQ execution units. On BDW and earlier, VPSLLQ also takes a port5 uop, but VPSLLVQ is even slower. Anyway, for immediate-count shifts (which is probably common after inlining), `VPSLLQ v, v, i` is definitely the most efficient way. — Peter Cordes, Sep 02 '16 at 00:45
BTW, you should use VPOR, not VPADDQ, for better throughput on pre-SKL. Also, I think you're missing any instructions to move data between elements. A large shift count can move data from the first qword to the last qword. An unaligned load might be good if the data isn't in a register to start with; then you only need to handle shift counts up to 7 or 63. (And you can use an immediate-count byte-shift or something, instead of looking up a shuffle mask from a table.) — Peter Cordes, Sep 03 '16 at 07:16

SIMD versions of SHLD/SHRD instructions

1 Answers1

Linked