Vectorizing indirect access through avx instructions

Question

I've recently been introduced to Vector Instructions (theoretically) and am excited about how I can use them to speed up my applications.

One area I'd like to improve is a very hot loop:

__declspec(noinline) void pleaseVectorize(int* arr, int* someGlobalArray, int* output)
{
    for (int i = 0; i < 16; ++i)
    {
        auto someIndex = arr[i];
        output[i] = someGlobalArray[someIndex];
    }

    for (int i = 0; i < 16; ++i)
    {
         if (output[i] == 1)
         {
             return i;
         }
    }

    return -1;
}

But of course, all 3 major compilers (msvc, gcc, clang) refuse to vectorize this. I can sort of understand why, but I wanted to get a confirmation.

If I had to vectorize this by hand, it would be:

(1) VectorLoad "arr", this brings in 16 4-byte integers let's say into zmm0

(2) 16 memory loads from the address pointed to by zmm0[0..3] into zmm1[0..3], load from address pointed into by zmm0[4..7] into zmm1[4..7] so on and so forth

(3) compare zmm0 and zmm1

(4) vector popcnt into the output to find out the most significant bit and basically divide that by 8 to get the index that matched

First of all, can vector instructions do these things? Like can they do this "gathering" operation, i.e. do a load from address pointing to zmm0?

Here is what clang generates:

0000000000400530 <_Z5superPiS_S_>:
  400530:       48 63 07                movslq (%rdi),%rax
  400533:       8b 04 86                mov    (%rsi,%rax,4),%eax
  400536:       89 02                   mov    %eax,(%rdx)
  400538:       48 63 47 04             movslq 0x4(%rdi),%rax
  40053c:       8b 04 86                mov    (%rsi,%rax,4),%eax
  40053f:       89 42 04                mov    %eax,0x4(%rdx)
  400542:       48 63 47 08             movslq 0x8(%rdi),%rax
  400546:       8b 04 86                mov    (%rsi,%rax,4),%eax
  400549:       89 42 08                mov    %eax,0x8(%rdx)
  40054c:       48 63 47 0c             movslq 0xc(%rdi),%rax
  400550:       8b 04 86                mov    (%rsi,%rax,4),%eax
  400553:       89 42 0c                mov    %eax,0xc(%rdx)
  400556:       48 63 47 10             movslq 0x10(%rdi),%rax
  40055a:       8b 04 86                mov    (%rsi,%rax,4),%eax
  40055d:       89 42 10                mov    %eax,0x10(%rdx)
  400560:       48 63 47 14             movslq 0x14(%rdi),%rax
  400564:       8b 04 86                mov    (%rsi,%rax,4),%eax
  400567:       89 42 14                mov    %eax,0x14(%rdx)
  40056a:       48 63 47 18             movslq 0x18(%rdi),%rax
  40056e:       8b 04 86                mov    (%rsi,%rax,4),%eax
  400571:       89 42 18                mov    %eax,0x18(%rdx)
  400574:       48 63 47 1c             movslq 0x1c(%rdi),%rax
  400578:       8b 04 86                mov    (%rsi,%rax,4),%eax
  40057b:       89 42 1c                mov    %eax,0x1c(%rdx)
  40057e:       48 63 47 20             movslq 0x20(%rdi),%rax
  400582:       8b 04 86                mov    (%rsi,%rax,4),%eax
  400585:       89 42 20                mov    %eax,0x20(%rdx)
  400588:       48 63 47 24             movslq 0x24(%rdi),%rax
  40058c:       8b 04 86                mov    (%rsi,%rax,4),%eax
  40058f:       89 42 24                mov    %eax,0x24(%rdx)
  400592:       48 63 47 28             movslq 0x28(%rdi),%rax
  400596:       8b 04 86                mov    (%rsi,%rax,4),%eax
  400599:       89 42 28                mov    %eax,0x28(%rdx)
  40059c:       48 63 47 2c             movslq 0x2c(%rdi),%rax
  4005a0:       8b 04 86                mov    (%rsi,%rax,4),%eax
  4005a3:       89 42 2c                mov    %eax,0x2c(%rdx)
  4005a6:       48 63 47 30             movslq 0x30(%rdi),%rax
  4005aa:       8b 04 86                mov    (%rsi,%rax,4),%eax
  4005ad:       89 42 30                mov    %eax,0x30(%rdx)
  4005b0:       48 63 47 34             movslq 0x34(%rdi),%rax
  4005b4:       8b 04 86                mov    (%rsi,%rax,4),%eax
  4005b7:       89 42 34                mov    %eax,0x34(%rdx)
  4005ba:       48 63 47 38             movslq 0x38(%rdi),%rax
  4005be:       8b 04 86                mov    (%rsi,%rax,4),%eax
  4005c1:       89 42 38                mov    %eax,0x38(%rdx)
  4005c4:       48 63 47 3c             movslq 0x3c(%rdi),%rax
  4005c8:       8b 04 86                mov    (%rsi,%rax,4),%eax
  4005cb:       89 42 3c                mov    %eax,0x3c(%rdx)
  4005ce:       c3                      retq
  4005cf:       90                      nop

Peter Cordes · Accepted Answer · 2018-06-22T19:08:48.147

Your idea of how it could work is close, except that you want a bit-scan / find-first-set-bit (x86 BSF or TZCNT) of the compare bitmap, not population-count (number of bits set).

AVX2 / AVX512 have vpgatherdd which does use a vector of signed 32-bit scaled indices. It's barely worth using on Haswell, improved on Broadwell, and very good on Skylake. (http://agner.org/optimize/, and see other links in the x86 tag wiki, such as Intel's optimization manual which has a section on gather performance). The SIMD compare and bitscan are very cheap by comparison; single uop and fully pipelined.

gcc8.1 can auto-vectorize your gather, if it can prove that your inputs don't overlap your output function arg. Sometimes possible after inlining, but for the non-inline version, you can promise this with int * __restrict output. Or if you make output a local temporary instead of a function arg. (General rule: storing through a non-_restrict pointer will often inhibit auto-vectorization, especially if it's a char* that can alias anything.)

gcc and clang never vectorize search loops; only loops where the trip-count can be calculated before entering the loop. But ICC can; it does a scalar gather and stores the result (even if output[] is a local so it doesn't have to do that as a side-effect of running the function), then uses SIMD packed-compare + bit-scan.

Compiler output for a __restrict version. Notice that gcc8.1 and ICC avoid 512-bit vectors by default when tuning for Skylake-AVX512. 512-bit vectors can limit the max-turbo, and always shut down the vector ALU on port 1 while they're in the pipeline, so it can make sense to use AVX512 or AVX2 with 256-bit vectors in case this function is only a small part of a big program. (Compilers don't know that this function is super-hot in your program.)

If output[] is a local, a better code-gen strategy would probably be to compare while gathering, so an early hit skips the rest of the loads. The compilers that go fully scalar (clang and MSVC) both miss this optimization. In fact, they even store to the local array even though clang mostly doesn't re-read it (keeping results in registers). Writing the source with the compare inside the first loop would work to get better scalar code. (Depending on cache misses from the gather vs. branch mispredicts from non-SIMD searching, scalar could be a good strategy. Especially if hits in the first few elements are common. Current gather hardware can't take advantage of multiple elements coming from the same cache line, so the hard limit is still 2 elements loaded per clock cycle. But using a wide vector load for the indices to feed a gather reduces load-port / cache access pressure significantly if your data was mostly hot in cache.)

A compiler could have auto-vectorized the __restrict version of your code to something like this. (gcc manages the gather part, ICC manages the SIMD compare part)

;; Windows x64 calling convention: rcx,rdx, r8,r9
; but of course you'd actually inline this
; only uses ZMM16..31, so vzeroupper not required

vmovdqu32   zmm16, [rcx/arr]   ; You def. want to reach an alignment boundary if you can for ZMM loads, vmovdqa32 will enforce that

kxnorw      k1, k0,k0      ; k1 = -1.  k0 false dep is likely not a problem.
  ; optional: vpxord  xmm17, xmm17, xmm17   ; break merge-masking false dep
vpgatherdd  zmm17{k1}, [rdx + zmm16 * 4]    ; GlobalArray + scaled-vector-index
; sets k1 = 0 when done

vmovdqu32   [r8/output], zmm17

vpcmpd      k1, zmm17, zmm31, 0    ; 0->EQ.  Outside the loop, do zmm31=set1_epi32(1)
                                   ; k1 = compare bitmap
kortestw    k1, k1
jz         .not_found      ; early check for not-found

kmovw       edx, k1

           ; tzcnt doesn't have a false dep on the output on Skylake
           ; so no AVX512 CPUs need to worry about that HSW/BDW issue
tzcnt       eax, edx       ; bit-scan for the first (lowest-address) set element
                           ; input=0 produces output=32
      ; or avoid the branch and let 32 be the not-found return value.
      ; or do a branchless kortestw / cmov if -1 is directly useful without branching
ret

.not_found:
   mov eax, -1
   ret

You can do this yourself with intrinsics:

Intel's instruction-set reference manual (HTML extract at http://felixcloutier.com/x86/index.html) includes C/C++ intrinsic names for each instruction, or search for them in https://software.intel.com/sites/landingpage/IntrinsicsGuide/

I changed the output type to __m512i. You could change it back to an array if you aren't manually vectorizing the caller. You definitely want this function to inline.

#include <immintrin.h>

//__declspec(noinline)  // I *hope* this was just to see the stand-alone asm version
                        // but it means the output array can't optimize away at all

//static inline
int find_first_1(const int *__restrict arr, const int *__restrict someGlobalArray, __m512i *__restrict output)
{
    __m512i vindex = _mm512_load_si512(arr);
    __m512i gather = _mm512_i32gather_epi32(vindex, someGlobalArray, 4);  // indexing by 4-byte int
    *output = gather;  

    __mmask16 cmp = _mm512_cmpeq_epi32_mask(gather, _mm512_set1_epi32(1));
       // Intrinsics make masks freely convert to integer
       // even though it costs a `kmov` instruction either way.
    int onepos =  _tzcnt_u32(cmp);
    if (onepos >= 16){
        return -1;
    }
    return onepos;
}

All 4 x86 compilers produce similar asm to what I suggested (see it on the Godbolt compiler explorer), but of course they have to actually materialize the set1_epi32(1) vector constant, or use a (broadcast) memory operand. Clang actually uses a {1to16} broadcast-load from a constant for the compare: vpcmpeqd k0, zmm1, dword ptr [rip + .LCPI0_0]{1to16}. (Of course they will make different choices whe inlined into a loop.) Others use mov eax,1 / vpbroadcastd zmm0, eax.

gcc8.1 -O3 -march=skylake-avx512 has two redundant mov eax, -1 instructions: one to feed a kmov for the gather, the other for the return-value stuff. Silly compiler should keep it around and use a different register for the 1.

All of them use zmm0..15 and thus can't avoid a vzeroupper. (xmm16.31 are not accessible with legacy-SSE, so the SSE/AVX transition penalty problem that vzeroupper solves doesn't exist if the only wide vector registers you use are y/zmm16..31). There may still be tiny possible advantages to vzeroupper, like cheaper context switches when the upper halves of ymm or zmm regs are known to be zero (Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?). If you're going to use it anyway, no reason to avoid xmm0..15.

Oh, and in the Windows calling convention, xmm6..15 are call-preserved. (Not ymm/zmm, just the low 128 bits), so zmm16..31 are a good choice if you run out of xmm0..5 regs.

wow, thanks! I have no idea about 90% of the syntax or how to start but I'm keen to figure it out. — halivingston, Jun 22 '18 at 04:44
@halivingston: Would this be a more useful answer for you if I translated it to intrinsics like [`_mm512_cmpeq_epi32_mask( __m512i a, __m512i b)`](http://felixcloutier.com/x86/VPCMPD:VPCMPUD.html) instead of asm? You showed clang's asm output so I assumed you'd be comfortable translating asm to C++ intrinsics. (Probably a bad assumption though if you didn't know that x86 had a gather instruction, oops :P) I was basically answering the question of what a compiler *could* have done, since it's what you seemed to be asking. — Peter Cordes, Jun 22 '18 at 04:55
I understand the concepts, but I will need to wade through the syntax. However, if you're able to update this answer with the intrinsics, I will appreciate the time savings! You've set me on the right course so I've accepted your answer regardless if you update it or not. — halivingston, Jun 22 '18 at 05:03
@halivingston: I wrote up an intrinsics version, partly since I was curious how exactly it would compile with the 4 different mainstream x86 compilers. — Peter Cordes, Jun 22 '18 at 05:51
Thanks a ton! I'm copy pasting into MSVC and will benchmark it. My intention is to keep it in a hot-loop over the same data so I'm not getting cache timing, etc. — halivingston, Jun 22 '18 at 06:13
Another question I have is .. why didn't they use zmm16... so they didn't need to do a vzeroupper? — halivingston, Jun 22 '18 at 06:14
@halivingston: xmm16.31 are not accessible with legacy-SSE, so [the SSE/AVX transition penalty problem that `vzeroupper` solves](https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake) doesn't exist for those registers. Probably compiler devs haven't thought of that optimization. I noticed it in an x265 hand-written asm comment or changelog. If you ever use xmm or ymm sub-registers, the shorter VEX encoding is nice, but you can't have that with xmm16..31. — Peter Cordes, Jun 22 '18 at 06:23
I don't care about legacy SSE, I only will ever run this on Skylake. But I'm thinking tzcnt wants to fall into the sse mode? find_first_1 is likely to be called hundreds of time, so I guess this version that the compiler generates is better? Sorry if I'm not making any sense but I'm wondering if there is more perf to be had if I hand-write this code as you've written. — halivingston, Jun 22 '18 at 06:31
@halivingston. `tzcnt` is a scalar integer instruction; it doesn't write an xmm reg while leaving the upper lanes unmodified; only legacy SSE instructions do that. You definitely want to let your compiler inline this function, so call overhead disappears and it doesn't have to `vzeroupper` + reload vector constants every time. The `vzeroupper` will sink out of the loop after inlining so it's not actually important, I was just rambling about it because we were talking about what a compiler would have to do for the non-inline version of this (which should never run in real life.) — Peter Cordes, Jun 22 '18 at 06:33
@halivingston: *I don't care about legacy SSE, I only will ever run this on Skylake*. Did you recompile `libm` to only use VEX-coded instructions for `sinf`, `logf`, and so on? What about libc functions like `strlen`? If not then your program probably runs some non-VEX legacy-SSE encoded instructions. Fortunately compilers take care of this for you, and `vzeroupper` is cheap on Skylake (only 4 uops), so don't worry about it. — Peter Cordes, Jun 22 '18 at 06:36
Thank you! I was checking out this post and it has some similar stuff we're doing here. https://blogs.msdn.microsoft.com/vcblog/2017/07/11/microsoft-visual-studio-2017-supports-intel-avx-512/ Vector instructions are fascinating, but I've realized in my field which is regular application development I can only use it in comparisons (like strings), arrays, and memcopies. Are there other logical constructs vectorization helps with? — halivingston, Jun 22 '18 at 06:38
@halivingston: one of my favourite examples of stuff you wouldn't expect SIMD to be usable for is [Fastest way to get IPv4 address from string](https://stackoverflow.com/q/31679341), and [How to implement atoi using SIMD?](https://stackoverflow.com/a/35132718). packed-compare bitmap -> lookup table of shuffle masks is a neat trick. You can use SIMD to left-pack (i.e. filter) an array (AVX512 being especially good for that): [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) — Peter Cordes, Jun 22 '18 at 06:44
@halivingston: you asked earlier about hand-writing asm: leave that to the compiler unless you can regularly look at compiler output and be confident you can do better, based on knowledge of microarchitectural details of the CPUs you care about. (See [Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture?](//stackoverflow.com/a/40355466) for an example of a novice writing asm that's slower than debug-mode `gcc -O0`, and an answer showing a case where you can beat `gcc -O3`'s code if you know what you're doing.) — Peter Cordes, Jun 22 '18 at 06:47

Vectorizing indirect access through avx instructions

1 Answers1