2

I implemented 4x4 matrix inverse in SSE2 and AVX. Both are faster than plain implementation. But if AVX is enabled (-mavx) then SSE2 implementation runs faster than manual AVX implementation. It seems compiler makes my SSE2 implementation more friendly with AVX :(

In my AVX implementation, there are less multiplications, less additions... So I expect that AVX could be faster than SSE. Maybe some intructions like _mm256_permute2f128_ps, _mm256_permutevar_ps/_mm256_permute_ps makes AVX slower? I'm not trying to load SSE/XMM register to AVX/YMM register.

How can I make my AVX implementation faster than SSE?

My CPU: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz (Ivy Bridge)

Plain with -O3      : 0.045853 secs
SSE2  with -O3      : 0.026021 secs
SSE2  with -O3 -mavx: 0.024336 secs
AVX1  with -O3 -mavx: 0.031798 secs

Updated (See bottom of question) all have -O3 -mavx flags:
AVX1 (reduced div)  : 0.027666 secs
AVX1 (using rcp_ps) : 0.023205 secs
SSE2 (using rcp_ps) : 0.021969 secs

Initial Matrix:

Matrix (float4x4):
    |0.0714    -0.6589  0.7488  2.0000|
    |0.9446     0.2857  0.1613  4.0000|
    |-0.3202    0.6958  0.6429  6.0000|
    |0.0000     0.0000  0.0000  1.0000|

Test codes:

start = clock();

for (int i = 0; i < 1000000; i++) {
  glm_mat4_inv_sse2(m, m);

  // glm_mat4_inv_avx(m, m);
  // glm_mat4_inv(m, m)
}

end   = clock();
total = (float)(end - start) / CLOCKS_PER_SEC;
printf("%f secs\n\n", total);

Implementations:

Library: http://github.com/recp/cglm

SSE Impl: https://gist.github.com/recp/690025c955c2e69a91e3a60a13768dee

AVX Impl: https://gist.github.com/recp/8ccc5ad0d19f5516de55f9bf7b5045b2

SSE2 implementation output (using godbolt; options -O3):

glm_mat4_inv_sse2:
        movaps  xmm8, XMMWORD PTR [rdi+32]
        movaps  xmm2, XMMWORD PTR [rdi+16]
        movaps  xmm5, XMMWORD PTR [rdi+48]
        movaps  xmm6, XMMWORD PTR [rdi]
        movaps  xmm4, xmm8
        movaps  xmm13, xmm8
        movaps  xmm11, xmm8
        shufps  xmm11, xmm2, 170
        shufps  xmm4, xmm5, 238
        movaps  xmm3, xmm11
        movaps  xmm1, xmm8
        pshufd  xmm12, xmm4, 127
        shufps  xmm13, xmm2, 255
        movaps  xmm0, xmm13
        movaps  xmm9, xmm8
        pshufd  xmm4, xmm4, 42
        shufps  xmm9, xmm2, 85
        shufps  xmm1, xmm5, 153
        movaps  xmm7, xmm9
        mulps   xmm0, xmm4
        pshufd  xmm10, xmm1, 42
        movaps  xmm1, xmm11
        shufps  xmm5, xmm8, 0
        mulps   xmm3, xmm12
        pshufd  xmm5, xmm5, 128
        mulps   xmm7, xmm12
        mulps   xmm1, xmm10
        subps   xmm3, xmm0
        movaps  xmm0, xmm13
        mulps   xmm0, xmm10
        mulps   xmm13, xmm5
        subps   xmm7, xmm0
        movaps  xmm0, xmm9
        mulps   xmm0, xmm4
        subps   xmm0, xmm1
        movaps  xmm1, xmm8
        movaps  xmm8, xmm11
        shufps  xmm1, xmm2, 0
        mulps   xmm8, xmm5
        movaps  xmm11, xmm7
        mulps   xmm4, xmm1
        mulps   xmm5, xmm9
        movaps  xmm9, xmm2
        mulps   xmm12, xmm1
        shufps  xmm9, xmm6, 85
        pshufd  xmm9, xmm9, 168
        mulps   xmm1, xmm10
        movaps  xmm10, xmm2
        shufps  xmm10, xmm6, 0
        pshufd  xmm10, xmm10, 168
        subps   xmm4, xmm8
        mulps   xmm7, xmm10
        movaps  xmm8, xmm2
        shufps  xmm2, xmm6, 255
        shufps  xmm8, xmm6, 170
        pshufd  xmm8, xmm8, 168
        pshufd  xmm2, xmm2, 168
        mulps   xmm11, xmm8
        subps   xmm12, xmm13
        movaps  xmm13, XMMWORD PTR .LC0[rip]
        subps   xmm1, xmm5
        movaps  xmm5, xmm3
        mulps   xmm5, xmm9
        mulps   xmm3, xmm10
        subps   xmm5, xmm11
        movaps  xmm11, xmm0
        mulps   xmm11, xmm2
        mulps   xmm0, xmm10
        addps   xmm5, xmm11
        movaps  xmm11, xmm12
        mulps   xmm11, xmm8
        mulps   xmm12, xmm9
        xorps   xmm5, xmm13
        subps   xmm3, xmm11
        movaps  xmm11, xmm4
        mulps   xmm4, xmm9
        subps   xmm7, xmm12
        mulps   xmm11, xmm2
        mulps   xmm2, xmm1
        mulps   xmm1, xmm8
        subps   xmm0, xmm4
        addps   xmm3, xmm11
        movaps  xmm11, XMMWORD PTR .LC1[rip]
        addps   xmm2, xmm7
        addps   xmm0, xmm1
        movaps  xmm1, xmm5
        xorps   xmm3, xmm11
        xorps   xmm2, xmm13
        shufps  xmm1, xmm3, 0
        xorps   xmm0, xmm11
        movaps  xmm4, xmm2
        shufps  xmm4, xmm0, 0
        shufps  xmm1, xmm4, 136
        mulps   xmm1, xmm6
        pshufd  xmm4, xmm1, 27
        addps   xmm1, xmm4
        pshufd  xmm4, xmm1, 65
        addps   xmm1, xmm4
        movaps  xmm4, XMMWORD PTR .LC2[rip]
        divps   xmm4, xmm1
        mulps   xmm5, xmm4
        mulps   xmm3, xmm4
        mulps   xmm2, xmm4
        mulps   xmm0, xmm4
        movaps  XMMWORD PTR [rsi], xmm5
        movaps  XMMWORD PTR [rsi+16], xmm3
        movaps  XMMWORD PTR [rsi+32], xmm2
        movaps  XMMWORD PTR [rsi+48], xmm0
        ret
.LC0:
        .long   0
        .long   2147483648
        .long   0
        .long   2147483648
.LC1:
        .long   2147483648
        .long   0
        .long   2147483648
        .long   0
.LC2:
        .long   1065353216
        .long   1065353216
        .long   1065353216
        .long   1065353216

SSE2 implementation (AVX enabled) output (using godbolt; options -O3 -mavx):

glm_mat4_inv_sse2:
        vmovaps xmm9, XMMWORD PTR [rdi+32]
        vmovaps xmm6, XMMWORD PTR [rdi+48]
        vmovaps xmm2, XMMWORD PTR [rdi+16]
        vmovaps xmm7, XMMWORD PTR [rdi]
        vshufps xmm5, xmm9, xmm6, 238
        vpshufd xmm13, xmm5, 127
        vpshufd xmm5, xmm5, 42
        vshufps xmm1, xmm9, xmm6, 153
        vshufps xmm11, xmm9, xmm2, 170
        vshufps xmm12, xmm9, xmm2, 255
        vmulps  xmm3, xmm11, xmm13
        vpshufd xmm1, xmm1, 42
        vmulps  xmm0, xmm12, xmm5
        vshufps xmm10, xmm9, xmm2, 85
        vshufps xmm6, xmm6, xmm9, 0
        vpshufd xmm6, xmm6, 128
        vmulps  xmm8, xmm10, xmm13
        vmulps  xmm4, xmm10, xmm5
        vsubps  xmm3, xmm3, xmm0
        vmulps  xmm0, xmm12, xmm1
        vsubps  xmm8, xmm8, xmm0
        vmulps  xmm0, xmm11, xmm1
        vsubps  xmm4, xmm4, xmm0
        vshufps xmm0, xmm9, xmm2, 0
        vmulps  xmm9, xmm12, xmm6
        vmulps  xmm13, xmm0, xmm13
        vmulps  xmm5, xmm0, xmm5
        vmulps  xmm0, xmm0, xmm1
        vsubps  xmm12, xmm13, xmm9
        vmulps  xmm9, xmm11, xmm6
        vmovaps xmm13, XMMWORD PTR .LC0[rip]
        vmulps  xmm6, xmm10, xmm6
        vshufps xmm10, xmm2, xmm7, 85
        vpshufd xmm10, xmm10, 168
        vsubps  xmm5, xmm5, xmm9
        vshufps xmm9, xmm2, xmm7, 170
        vpshufd xmm9, xmm9, 168
        vsubps  xmm1, xmm0, xmm6
        vmulps  xmm11, xmm8, xmm9
        vshufps xmm0, xmm2, xmm7, 0
        vshufps xmm2, xmm2, xmm7, 255
        vmulps  xmm6, xmm3, xmm10
        vpshufd xmm2, xmm2, 168
        vpshufd xmm0, xmm0, 168
        vmulps  xmm3, xmm3, xmm0
        vmulps  xmm8, xmm8, xmm0
        vmulps  xmm0, xmm4, xmm0
        vsubps  xmm6, xmm6, xmm11
        vmulps  xmm11, xmm4, xmm2
        vaddps  xmm6, xmm6, xmm11
        vmulps  xmm11, xmm12, xmm9
        vmulps  xmm12, xmm12, xmm10
        vxorps  xmm6, xmm6, xmm13
        vsubps  xmm3, xmm3, xmm11
        vmulps  xmm11, xmm5, xmm2
        vmulps  xmm5, xmm5, xmm10
        vsubps  xmm8, xmm8, xmm12
        vmulps  xmm2, xmm1, xmm2
        vmulps  xmm1, xmm1, xmm9
        vaddps  xmm3, xmm3, xmm11
        vmovaps xmm11, XMMWORD PTR .LC1[rip]
        vsubps  xmm0, xmm0, xmm5
        vaddps  xmm2, xmm8, xmm2
        vxorps  xmm3, xmm3, xmm11
        vaddps  xmm0, xmm0, xmm1
        vshufps xmm1, xmm6, xmm3, 0
        vxorps  xmm2, xmm2, xmm13
        vxorps  xmm0, xmm0, xmm11
        vshufps xmm4, xmm2, xmm0, 0
        vshufps xmm1, xmm1, xmm4, 136
        vmulps  xmm1, xmm1, xmm7
        vpshufd xmm4, xmm1, 27
        vaddps  xmm1, xmm1, xmm4
        vpshufd xmm4, xmm1, 65
        vaddps  xmm1, xmm1, xmm4
        vmovaps xmm4, XMMWORD PTR .LC2[rip]
        vdivps  xmm1, xmm4, xmm1
        vmulps  xmm6, xmm6, xmm1
        vmulps  xmm3, xmm3, xmm1
        vmulps  xmm2, xmm2, xmm1
        vmulps  xmm1, xmm0, xmm1
        vmovaps XMMWORD PTR [rsi], xmm6
        vmovaps XMMWORD PTR [rsi+16], xmm3
        vmovaps XMMWORD PTR [rsi+32], xmm2
        vmovaps XMMWORD PTR [rsi+48], xmm1
        ret
.LC0:
        .long   0
        .long   2147483648
        .long   0
        .long   2147483648
.LC1:
        .long   2147483648
        .long   0
        .long   2147483648
        .long   0
.LC2:
        .long   1065353216
        .long   1065353216
        .long   1065353216
        .long   1065353216

AVX implementation output (using godbolt; options -O3 -mavx):

glm_mat4_inv_avx:
        vmovaps ymm3, YMMWORD PTR [rdi]
        vmovaps ymm1, YMMWORD PTR [rdi+32]
        vmovdqa ymm2, YMMWORD PTR .LC1[rip]
        vmovdqa ymm0, YMMWORD PTR .LC0[rip]
        vperm2f128      ymm6, ymm3, ymm3, 3
        vperm2f128      ymm5, ymm1, ymm1, 0
        vperm2f128      ymm1, ymm1, ymm1, 17
        vmovdqa ymm10, YMMWORD PTR .LC4[rip]
        vpermilps       ymm9, ymm5, ymm0
        vpermilps       ymm7, ymm1, ymm2
        vperm2f128      ymm8, ymm6, ymm6, 0
        vpermilps       ymm1, ymm1, ymm0
        vpermilps       ymm5, ymm5, ymm2
        vpermilps       ymm0, ymm8, ymm0
        vmulps  ymm4, ymm7, ymm9
        vpermilps       ymm8, ymm8, ymm2
        vpermilps       ymm11, ymm6, 1
        vmulps  ymm2, ymm5, ymm1
        vmulps  ymm7, ymm0, ymm7
        vmulps  ymm1, ymm8, ymm1
        vmulps  ymm0, ymm0, ymm5
        vmulps  ymm5, ymm8, ymm9
        vmovdqa ymm9, YMMWORD PTR .LC3[rip]
        vmovdqa ymm8, YMMWORD PTR .LC2[rip]
        vsubps  ymm4, ymm4, ymm2
        vsubps  ymm7, ymm7, ymm1
        vperm2f128      ymm2, ymm4, ymm4, 0
        vperm2f128      ymm4, ymm4, ymm4, 17
        vshufps ymm1, ymm2, ymm4, 77
        vpermilps       ymm1, ymm1, ymm9
        vsubps  ymm5, ymm0, ymm5
        vpermilps       ymm0, ymm2, ymm8
        vmulps  ymm0, ymm0, ymm11
        vperm2f128      ymm1, ymm1, ymm2, 0
        vshufps ymm2, ymm2, ymm4, 74
        vpermilps       ymm4, ymm6, 90
        vmulps  ymm1, ymm1, ymm4
        vpermilps       ymm2, ymm2, ymm10
        vpermilps       ymm6, ymm6, 191
        vmovaps ymm11, YMMWORD PTR .LC5[rip]
        vperm2f128      ymm2, ymm2, ymm2, 0
        vperm2f128      ymm4, ymm3, ymm3, 0
        vpermilps       ymm12, ymm4, YMMWORD PTR .LC7[rip]
        vmulps  ymm2, ymm2, ymm6
        vinsertf128     ymm6, ymm7, xmm5, 1
        vperm2f128      ymm5, ymm7, ymm5, 49
        vshufps ymm7, ymm6, ymm5, 77
        vpermilps       ymm9, ymm7, ymm9
        vsubps  ymm0, ymm0, ymm1
        vpermilps       ymm1, ymm4, YMMWORD PTR .LC6[rip]
        vpermilps       ymm4, ymm4, YMMWORD PTR .LC8[rip]
        vaddps  ymm2, ymm0, ymm2
        vpermilps       ymm0, ymm6, ymm8
        vshufps ymm6, ymm6, ymm5, 74
        vpermilps       ymm6, ymm6, ymm10
        vmulps  ymm1, ymm1, ymm0
        vmulps  ymm0, ymm12, ymm9
        vmulps  ymm6, ymm4, ymm6
        vxorps  ymm2, ymm2, ymm11
        vdpps   ymm3, ymm3, ymm2, 255
        vsubps  ymm0, ymm1, ymm0
        vdivps  ymm2, ymm2, ymm3
        vaddps  ymm0, ymm0, ymm6
        vxorps  ymm0, ymm0, ymm11
        vdivps  ymm0, ymm0, ymm3
        vperm2f128      ymm5, ymm2, ymm2, 3
        vshufps ymm1, ymm2, ymm5, 68
        vshufps ymm2, ymm2, ymm5, 238
        vperm2f128      ymm4, ymm0, ymm0, 3
        vshufps ymm6, ymm0, ymm4, 68
        vshufps ymm0, ymm0, ymm4, 238
        vshufps ymm3, ymm1, ymm6, 136
        vshufps ymm1, ymm1, ymm6, 221
        vinsertf128     ymm1, ymm3, xmm1, 1
        vshufps ymm3, ymm2, ymm0, 136
        vshufps ymm0, ymm2, ymm0, 221
        vinsertf128     ymm0, ymm3, xmm0, 1
        vmovaps YMMWORD PTR [rsi], ymm1
        vmovaps YMMWORD PTR [rsi+32], ymm0
        vzeroupper
        ret
.LC0:
        .long   2
        .long   1
        .long   1
        .long   0
        .long   0
        .long   0
        .long   0
        .long   0
.LC1:
        .long   3
        .long   3
        .long   2
        .long   3
        .long   2
        .long   1
        .long   1
        .long   1
.LC2:
        .long   0
        .long   0
        .long   1
        .long   2
        .long   0
        .long   0
        .long   1
        .long   2
.LC3:
        .long   0
        .long   1
        .long   1
        .long   2
        .long   0
        .long   1
        .long   1
        .long   2
.LC4:
        .long   0
        .long   2
        .long   3
        .long   3
        .long   0
        .long   2
        .long   3
        .long   3
.LC5:
        .long   0
        .long   2147483648
        .long   0
        .long   2147483648
        .long   2147483648
        .long   0
        .long   2147483648
        .long   0
.LC6:
        .long   1
        .long   0
        .long   0
        .long   0
        .long   1
        .long   0
        .long   0
        .long   0
.LC7:
        .long   2
        .long   2
        .long   1
        .long   1
        .long   2
        .long   2
        .long   1
        .long   1
.LC8:
        .long   3
        .long   3
        .long   3
        .long   2
        .long   3
        .long   3
        .long   3
        .long   2

EDIT:

I'm using Xcode (Version 10.0 (10A255)) on macOS (on MacBook Pro (Retina, Mid 2012) 15') to build and run tests with -O3 optimization option. It compiles test codes with clang. I used GCC 8.2 in godbolt to view asm (sorry for this), but the assembly output seems similar.

I was enabled shuffd by enabling cglm option: CGLM_USE_INT_DOMAIN. I was forgot to disable it when viewing asm.

#ifdef CGLM_USE_INT_DOMAIN
#  define glmm_shuff1(xmm, z, y, x, w)                                        \
     _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(xmm),                \
                                        _MM_SHUFFLE(z, y, x, w)))
#else
#  define glmm_shuff1(xmm, z, y, x, w)                                        \
     _mm_shuffle_ps(xmm, xmm, _MM_SHUFFLE(z, y, x, w))
#endif

Whole test codes (except headers):

#include <cglm/cglm.h>

#include <sys/time.h>
#include <time.h>

int
main(int argc, const char * argv[]) {
  CGLM_ALIGN(32) mat4 m = GLM_MAT4_IDENTITY_INIT;

  double start, end, total;

  /* generate invertible matrix */
  glm_translate(m, (vec3){1,2,3});
  glm_rotate(m, M_PI_2, (vec3){1,2,3});
  glm_translate(m, (vec3){1,2,3});

  glm_mat4_print(m, stderr);

  start = clock();

  for (int i = 0; i < 1000000; i++) {
    glm_mat4_inv_sse2(m, m);

    // glm_mat4_inv_avx(m, m);
    // glm_mat4_inv(m, m);
  }

  end   = clock();
  total = (float)(end - start) / CLOCKS_PER_SEC;

  printf("%f secs\n\n", total);

  glm_mat4_print(m, stderr);
}

EDIT 2:

I have reduced one division by using multiplication (1 set_ps + 1 div_ps + 2 mul_ps seems better than 2 div_ps):

Old Version:

r1 = _mm256_div_ps(r1, y4);
r2 = _mm256_div_ps(r2, y4);

New Version (SSE2 version was used division like this):

y5 = _mm256_div_ps(_mm256_set1_ps(1.0f), y4);
r1 = _mm256_mul_ps(r1, y5);
r2 = _mm256_mul_ps(r2, y5);

New Version (Fast version):

y5 = _mm256_rcp_ps(y4);
r1 = _mm256_mul_ps(r1, y5);
r2 = _mm256_mul_ps(r2, y5);

Now it is better than before but still not faster than SSE on Ivy Bridge CPU. I updated the test results.

recp
  • 253
  • 1
  • 2
  • 13
  • rcpps is very low precision, like only 12 bits. If that's enough, then great. But often it's used with a Newton-Raphson iteration to nearly double that. If divider execution unit throughput is a bottleneck, then that can be a win, otherwise `divps` is a single uop with high latency, but isn't a throughput problem unless the not-fully-pipelined FP divider is already busy. (It's only 128 bits wide, though, so 256-bit division has half throughput, unlike other ALU stuff. [Floating point division vs floating point multiplication](https://stackoverflow.com/a/45899202) – Peter Cordes Oct 30 '18 at 06:33
  • `If that's enough, then great` it is optional. `glm_mat4_inv()` uses 1/div + mul, and `glm_mat4_inv_fast()` uses `rcpps` to increase speed but with less accuracy/less precision, it is user's choice. I think I should understand "throughput vs latency", I added it to my TODOs. Actually I was tried understand that but it seems it was not enough :( – recp Oct 30 '18 at 06:42
  • Yes, calculating 1/x once, and then multiplying by that twice is good, whether you use `divps` or `rcpps`. Re: throughput vs. latency: related [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391). Also related, the badly titled [How does a single thread run on multiple cores?](https://softwareengineering.stackexchange.com/a/350024), and also [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?](https://stackoverflow.com/q/45113527) – Peter Cordes Oct 30 '18 at 06:55
  • @PeterCordes thank you very much for your valuable comments and links/resources. I'll read all later. – recp Oct 30 '18 at 07:11

1 Answers1

7

Your CPU is an Intel IvyBridge.

Sandybridge / IvyBridge has 1-per-clock mul and add throughput, on different ports so they don't compete with each other.

But it only 1 per clock shuffle throughput for 256-bit shuffles, and all FP shuffles (even 128-bit shufps). However, it has 2-per-clock throughput for integer shuffles, and I notice your compiler is using pshufd as a copy-and-shuffle between FP instructions. This is a solid win when compiling for SSE2, especially where the VEX encoding isn't available (so it's saving a movaps by replacing movaps xmm0, xmm1 / shufps xmm0, xmm0, 65 or whatever.) Your compiler is doing this even when AVX is available so it could have used vshufps xmm0, xmm1,xmm1, 65, but it's either cleverly choosing vpshufd for microarchitectural reasons, or it got lucky, or its heuristics / instruction cost model were designed with this in mind. (I suspect it was clang, but you didn't say in the question or show the C source you compiled from).

In Haswell and later (which supports AVX2 and thus 256-bit versions of every integer shuffle), all shuffles can only run on port 5. But in IvB where only AVX1 is supported, it's only FP shuffles that go up to 256 bits. Integer shuffles are always only 128 bits, and can run on port 1 or port 5, because there are 128-bit shuffle execution units on both those ports. (https://agner.org/optimize/)


I haven't looked at the asm in a ton of detail because it's long, but if it costs you more shuffles to save on adds / multiplies by using wider vectors, that would be be slower.

As well as because all your shuffles become FP shuffles so they only run on port 5, not taking advantage of port 1. I suspect there's so much shuffling that it's a bottleneck vs. port 0 (FP multiply) or port 1 (FP add).

BTW, Haswell and later have two FMA units, one each on p0 and p1, so multiply has twice the throughput. Skylake and later runs FP add on those FMA units as well, so they both have 2 per clock throughput. (And if you can usefully use actual FMA instructions, you can get twice the work done.)

Also, your benchmark is testing latency, not thoughput, because the same m is the input and output. There might be enough instruction-level parallelism to just bottleneck on shuffle throughput, though.

Lane-crossing shuffles like vperm2f128 and vinsertf128 have 2 cycle latency on IvB, vs. in-lane shuffles (including all 128-bit shuffles) having only single cycle latency. Intel's guides claim a different number, IIRC, but 2 cycles is what Agner Fog's actual measurements found in practice in a dependency chain. (This is probably 1 cycle + some kind of bypass delay). On Haswell and later, lane-crossing shuffles are 3 cycle latency. Why are some Haswell AVX latencies advertised by Intel as 3x slower than Sandy Bridge?

Also related: Do 128bit cross lane operations in AVX512 give better performance? you can sometimes reduce the amount of shuffling with an unaligned load that cuts into 128-bit halves at a useful point, and then use in-lane shuffles. That's potentially useful for AVX1 because it lacks vpermps or other lane-crossing shuffles with granularity less than 128 bits.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • Great answer! I added more detail to bottom of question, while using clang on tests I used GCC to see asm, sorry for that. – recp Oct 29 '18 at 12:34
  • Is it possible to use `_mm256_permutevar_ps`, `_mm256_shuffle_ps` on another port (e.g integer domain) to make it faster (I'll check Intel Intrinsics Guide)? What do you suggest, should I keep AVX version or should I drop it and use SSE version for the future? – recp Oct 29 '18 at 12:36
  • @recp: no, there aren't any Intel CPUs with better than 1 per clock throughput for any 256-bit shuffles. Fast 128-bit shuffles is a feature of Nehalem through IvB. It costs a lot of transistors to build wide shuffle units, and Intel cut back on shuffle hardware for Haswell. – Peter Cordes Oct 29 '18 at 17:52
  • @recp: you should benchmark on a Haswell and/or Skylake, if possible. I haven't tried to do static analysis on your code by hand or with IACA ([What is IACA and how do I use it?](/q/26021337)) to see which one has more total shuffles, but if they're about the same speed on Haswell then use the one that's fastest on IvB. Or an AVX2 + FMA version for Haswell might give you a speedup with lane-crossing 256-bit shuffles, but make sure you try adding FMA separately from AVX2 in case FMA + 128-bit shuffles is the best. That's probably really good on Ryzen. – Peter Cordes Oct 29 '18 at 17:55
  • I don't think I can find Haswell or Skylake, for now :( I'll try to check IACA later. My CPU (IvB) is not supporting AVX2 or FMA :( This is why I only implemented AVX1 for now. After upgraded my Macbook I'll implement new CPU features. – recp Oct 29 '18 at 18:04
  • `Fast 128-bit shuffles is a feature of Nehalem through IvB. It costs a lot of transistors to build wide shuffle units, and Intel cut back on shuffle hardware for Haswell.` This is an interesting and valuable detail. Maybe AVX version could give better results on IvB+ (Haswell, Skylake...) – recp Oct 29 '18 at 18:07