128-bit shifts using assembly language?

Question

What is the most efficient way to do 128 bit shift on a modern Intel CPU (core i7, sandy bridge).

A similar code is in my most inner loop:

u128 a[N];
void xor() {
  for (int i = 0; i < N; ++i) {
    a[i] = a[i] ^ (a[i] >> 1) ^ (a[i] >> 2);
  }
}

The data in a[N] is almost random.

You could start by turning on maximum optimization and seeing what the compiler generates. — Raymond Chen, Oct 24 '11 at 02:27
Can you show us the definition of `u128`? I can probably provide an efficient solution using SSE. — Mysticial, Oct 24 '11 at 02:48
Shift intrinsics are listed here: http://msdn.microsoft.com/en-us/library/edy397f8.aspx — Hans Passant, Oct 24 '11 at 13:14
`_mm_slli_si128`: Shifts the 128-bit value in a left [sic] by imm bytes while shifting in zeros. Etcetera. — Hans Passant, Oct 25 '11 at 14:40
Usually a "shift" (like the c operator ">>" in the question) means a bit-shift, and that was what asked for. Additionally Intels byte shift is restricted to a constant amount. If you wanted to byte shift by a variable amount, you are better of avoiding this instruction, and do a misaligned store/load, since the misalignment penalty on i7 and newer is neglectable. — Gunther Piez, Oct 25 '11 at 20:14
Windows or Linux? Or more correctly, MSVC or GCC (and friends), or maybe MASM versus NASM? — jww, Jul 19 '15 at 02:52

GJ. · Accepted Answer · 2011-10-24T10:34:52.383

Using instruction Shift Double.

So SHLD or SHRD instruction, because SSE isn't intended for this purpose. There is a clasic method, here are you have test cases for 128 bit left shift by 16 bits under 32 and 64 bit CPU mode.

On this way you can perform unlimited size shift for up to 32/64 bits. Yoo can shift for immediate number of bits or for number in cl register. First instruction operant can also address variable in memory.

128 bit left shift by 16 bits under 32 bit x86 CPU mode:

    mov     eax, $04030201;
    mov     ebx, $08070605;
    mov     ecx, $0C0B0A09;
    mov     edx, $100F0E0D;

    shld    edx, ecx, 16
    shld    ecx, ebx, 16
    shld    ebx, eax, 16
    shl     eax, 16

And 128 bit left shift by 16 bits under 64 bit x86 CPU mode:

    mov    rax, $0807060504030201;
    mov    rdx, $100F0D0E0B0C0A09;

    shld   rdx, rax, 16
    shl    rax, 16

I have used this. It works and is reasonably fast, but you should mention that the 32 bit code allows shift up to 31 and the 64 bit code up to 63. If you want to shift by a variable amount, which can't be guaranteed to be less than 64, this can't be used. — Gunther Piez, Oct 25 '11 at 14:31
@drhirsch: I have mention up to 32/64 bits and of course it should be up to 31/63bits if you want more than move 32/64bit words. — GJ., Oct 25 '11 at 15:31

Marat Dukhan · Answer 2 · 2011-10-26T06:53:42.827

In this particular case you could use a combination of x86 SHR and RCR instructions:

; a0 - bits 0-31 of a[i]
; a1 - bits 32-63 of a[i]
; a2 - bits 64-95 of a[i]
; a3 - bits 96-127 of a[i]
mov eax, a0
mov ebx, a1
mov ecx, a2
mov ecx, a3

shr eax, 1
rcr ebx, 1
rcr ecx, 1
rcr edx, 1

; b0 - bits 0-31 of b[i] := a[i] >> 1
; b1 - bits 32-63 of b[i] := a[i] >> 1
; b2 - bits 64-95 of b[i] := a[i] >> 1
; b3 - bits 96-127 of b[i] := a[i] >> 1
mov b0, eax
mov b1, ebx
mov b2, ecx
mov b3, edx

shr eax, 1
rcr ebx, 1
rcr ecx, 1
rcr edx, 1

; c0 - bits 0-31 of c[i] := a[i] >> 2 = b[i] >> 1
; c1 - bits 32-63 of c[i] := a[i] >> 2 = b[i] >> 1
; c2 - bits 64-95 of c[i] := a[i] >> 2 = b[i] >> 1
; c3 - bits 96-127 of c[i] := a[i] >> 2 = b[i] >> 1
mov c0, eax
mov c1, ebx
mov c2, ecx
mov c3, edx

If your target is x86-64 this simplifies to:

; a0 - bits 0-63 of a[i]
; a1 - bits 64-127 of a[i]
mov rax, a0
mov rbx, a1

shr rax, 1
rcr rbx, 1

; b0 - bits 0-63 of b[i] := a[i] >> 1
; b1 - bits 64-127 of b[i] := a[i] >> 1
mov b0, rax
mov b1, rbx

shr rax, 1
rcr rbx, 1

; c0 - bits 0-63 of c[i] := a[i] >> 2 = b[i] >> 1
; c1 - bits 64-127 of c[i] := a[i] >> 2 = b[i] >> 1
mov c0, rax
mov c1, rbx

Update: corrected typos in 64-bit version

Unfortunately the RCR/RCL instructions are exceptionally slow on almost all modern processors. SHLD/SHRD is a better alternative — Gunther Piez, Oct 25 '11 at 14:36
And in second case instead **shr eax, 1; rcr ebx, 1** must be **shr rax, 1; rcr rbx, 1** — GJ., Oct 25 '11 at 16:01
RCR/RCL is fast when the second argument is 1. This is exactly the case for this problem. When the second argument is 1 RCR/RCL is faster than SHLD/SHRD on all modern CPUs: — Marat Dukhan, Oct 26 '11 at 06:56
When the second argument is 1 RCR/RCL is faster than SHLD/SHRD on all modern CPUs except Sandy Bridge and Atom. — Marat Dukhan, Oct 26 '11 at 07:08

128-bit shifts using assembly language?

2 Answers2

Linked