Vectorizing a loop over float x,y,z arrays calculating length and differences using SSE Intrinsics

Question

I am trying to convert a loop I have into a SSE intrinsics. I seem to have made fairly good progress, and by that I mean It's in the correct direction however I appear to have done some of the translation wrong somewhere as I am not getting the same "correct" answer which results from the non-sse code.

My original loop which I unrolled by a factor of 4 looks like this:

int unroll_n = (N/4)*4;

for (int j = 0; j < unroll_n; j++) {
        for (int i = 0; i < unroll_n; i+=4) {
            float rx = x[j] - x[i];
            float ry = y[j] - y[i];
            float rz = z[j] - z[i];
            float r2 = rx*rx + ry*ry + rz*rz + eps;
            float r2inv = 1.0f / sqrt(r2);
            float r6inv = r2inv * r2inv * r2inv;
            float s = m[j] * r6inv;
            ax[i] += s * rx;
            ay[i] += s * ry;
            az[i] += s * rz;
            //u
            rx = x[j] - x[i+1];
             ry = y[j] - y[i+1];
             rz = z[j] - z[i+1];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+1] += s * rx;
            ay[i+1] += s * ry;
            az[i+1] += s * rz;
            //unroll i 3
             rx = x[j] - x[i+2];
             ry = y[j] - y[i+2];
             rz = z[j] - z[i+2];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+2] += s * rx;
            ay[i+2] += s * ry;
            az[i+2] += s * rz;
            //unroll i 4
             rx = x[j] - x[i+3];
             ry = y[j] - y[i+3];
             rz = z[j] - z[i+3];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+3] += s * rx;
            ay[i+3] += s * ry;
            az[i+3] += s * rz;
    }
}

I essentially then went line by line for the top section and converted it into SSE intrinsics. The code is below. I'm not totally sure if the top three lines are needed however I understand that my data needs to be 16bit aligned for this to work correctly and optimally.

float *x = malloc(sizeof(float) * N);
float *y = malloc(sizeof(float) * N);
float *z = malloc(sizeof(float) * N); 

for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i+=4) {
        __m128 xj_v = _mm_set1_ps(x[j]);
        __m128 xi_v = _mm_load_ps(&x[i]);
        __m128 rx_v = _mm_sub_ps(xj_v, xi_v);

        __m128 yj_v = _mm_set1_ps(y[j]);
        __m128 yi_v = _mm_load_ps(&y[i]);
        __m128 ry_v = _mm_sub_ps(yj_v, yi_v);

        __m128 zj_v = _mm_set1_ps(z[j]);
        __m128 zi_v = _mm_load_ps(&z[i]);
        __m128 rz_v = _mm_sub_ps(zj_v, zi_v);

    __m128 r2_v = _mm_mul_ps(rx_v, rx_v) + _mm_mul_ps(ry_v, ry_v) + _mm_mul_ps(rz_v, rz_v) + _mm_set1_ps(eps);

    __m128 r2inv_v = _mm_div_ps(_mm_set1_ps(1.0f),_mm_sqrt_ps(r2_v));

    __m128 r6inv_1v = _mm_mul_ps(r2inv_v, r2inv_v);
    __m128 r6inv_v = _mm_mul_ps(r6inv_1v, r2inv_v);

    __m128 mj_v = _mm_set1_ps(m[j]);
    __m128 s_v = _mm_mul_ps(mj_v, r6inv_v);

    __m128 axi_v = _mm_load_ps(&ax[i]);
    __m128 ayi_v = _mm_load_ps(&ay[i]);
    __m128 azi_v = _mm_load_ps(&az[i]);

    __m128 srx_v = _mm_mul_ps(s_v, rx_v);
    __m128 sry_v = _mm_mul_ps(s_v, ry_v);
    __m128 srz_v = _mm_mul_ps(s_v, rz_v);

    axi_v = _mm_add_ps(axi_v, srx_v);
    ayi_v = _mm_add_ps(ayi_v, srx_v);
    azi_v = _mm_add_ps(azi_v, srx_v);

    _mm_store_ps(ax, axi_v);
    _mm_store_ps(ay, ayi_v);
    _mm_store_ps(az, azi_v);
    }
}

I feel the main idea is correct however there is an/some error(s) somewhere as the resulting answer is incorrect.

It'd be really useful if you posted compilable code, rather than a fragment. — EOF, Mar 01 '16 at 23:19
There is also an instruction which calculates the reciprocal square root: `_mm_rsqrt_ps` — fsasm, Mar 01 '16 at 23:21
@fsasm: That's only an approximation. It's not suitable for precise calculations, unless refined by a few steps of newton-raphson. — EOF, Mar 01 '16 at 23:25
@EOF Right, I forgot to mention it. The accuracy of rsqrt is good enough for some calculations in games and other calculations that refine the result with some iterative methods (like Newton-Raphson-Method). Provided the developer knows what is happening. Nonetheless it can greatly speedup the calculation — fsasm, Mar 01 '16 at 23:39
Note that using the `+` operator on vectors, instead of `_mm_add_ps`, is a [GNU extension](https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html). You're doing this on the line that sums the squares of the three components. Also, SSE loads/stores have to be 16**byte** aligned, unless you use `loadu`/`storeu` instrinsics. (Can be a tiny speed penalty even when the data does turn out to be aligned at runtime, since it prevents folding loads into ALU ops.) — Peter Cordes, Mar 02 '16 at 01:06
@EOF: rsqrtps + one newton-raphson iteration gives a precision of [+/-2ulp for single precision](http://stackoverflow.com/a/32002316/224132). The hardware sqrt instruction is a better choice for double precision. I'm not sure it's win for this case. — Peter Cordes, Mar 02 '16 at 04:25
Have you tried auto-vecotization? [GCC with use `rsqrtps` and one iteration of newton-raphson with `-Ofast`](https://stackoverflow.com/questions/34585770/gcc-4-8-avx-optimization-bug-extra-code-insertion/34586817#34586817). — Z boson, Mar 02 '16 at 08:13

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

I think your only bugs are simple typos, not logic errors, see below.

Can't you just use clang's auto-vectorization? Or do you need to use gcc for this code? Auto-vectorization would let you make SSE, AVX, and (in future) AVX512 versions from the same source with no modifications. Intrinsics aren't scalable to different vector sizes, unfortunately.

Based on your start at vectorizing, I made an optimized version. You should try it out, I'm curious to hear if it's faster than your version with bugfixes, or clang's auto-vectorized version. :)

This looks wrong:

_mm_store_ps(ax, axi_v);
_mm_store_ps(ay, ayi_v);
_mm_store_ps(az, azi_v);

You loaded from ax[i], but now you're storing to ax[0].

Also, clang's unused-variable warning found this bug:

axi_v = _mm_add_ps(axi_v, srx_v);
ayi_v = _mm_add_ps(ayi_v, srx_v);  // should be sry_v
azi_v = _mm_add_ps(azi_v, srx_v);  // should be srz_v

Like I said in my answer on your previous question, you should probably interchange the loops, so the same ax[i+0..3], ay[i+0..3], and az[i+0..3] are used, avoiding that load/store.

Also, if you're not going to use rsqrtps + Newton-Raphson, you should use the transformation I pointed out in my last answer: divide m[j] by sqrt(k2) ^3. There's no point dividing 1.0f by something using a divps, and then later multiplying only once.

rsqrt might not actually be a win, because total uop throughput might be more of a bottleneck than div / sqrt throughput or latency. three multiplies + FMA + rsqrtps is significantly more uops than sqrtps + divps. rsqrt is more helpful with AVX 256b vectors, because the divide / sqrt unit isn't full-width on SnB to Broadwell. Skylake has 12c latency sqrtps ymm, same as for xmm, but throughput is still better for xmm (one per 3c instead of one per 6c).

clang and gcc were both using rsqrtps / rsqrtss when compiling your code with -ffast-math. (only clang using the packed version, of course.)

If you don't interchange the loops, you should manually hoist everything that only depends on j out of the inner loop. Compilers are generally good at this, but it still seems like a good idea to make the source reflect what you expect the compiler to be able to do. This helps with "seeing" what the loop is actually doing.

Here's a version with some optimizations over your original:

To get gcc/clang to fuse the mul/adds into FMA, I used -ffp-contract=fast. This gets FMA instructions for high throughput without using -ffast-math. (There is a lot of parallelism with the three separate accumulators, so the increased latency of FMA compared to addps shouldn't hurt at all. I expect port0/1 throughput is the limiting factor here.) I thought gcc did this automatically, but it seems it doesn't here without -ffast-math.
Notice that v^3/2 = sqrt(v)³ = sqrt(v)*v. This has lower latency and fewer instructions.
Interchanged the loops, and use broadcast-loads in the inner loop to improve locality (cut bandwidth requirement by 4, or 8 with AVX). Each iteration of the inner loop only reads 4B of new data from each of the four source arrays. (x,y,z, and m). So it makes a lot of use of each cache line while it's hot.

Using broadcast-loads in the inner loop also means we accumulate ax[i + 0..3] in parallel, avoiding the need for a horizontal sum, which takes extra code. (See a previous version of this answer for code with the loops interchanged, but that used vector loads in the inner loop, with stride = 16B.)

It compiles nicely for Haswell with gcc, using FMA. (Still only 128b vector size though, unlike clang's auto-vectorized 256b version). The inner loop is only 20 instructions, and only 13 of those are FPU ALU instructions that need port 0/1 on Intel SnB-family. It makes decent code even with baseline SSE2: no FMA, and needs shufps for the broadcast-loads, but those don't compete for execution units with add/mul.

#include <immintrin.h>

void ffunc_sse128(float *restrict ax, float *restrict ay, float *restrict az,
           const float *x, const float *y, const float *z,
           int N, float eps, const float *restrict m)
{
  for (int i = 0; i < N; i+=4) {
    __m128 xi_v = _mm_load_ps(&x[i]);
    __m128 yi_v = _mm_load_ps(&y[i]);
    __m128 zi_v = _mm_load_ps(&z[i]);

    // vector accumulators for ax[i + 0..3] etc.
    __m128 axi_v = _mm_setzero_ps();
    __m128 ayi_v = _mm_setzero_ps();
    __m128 azi_v = _mm_setzero_ps();
    
    // AVX broadcast-loads are as cheap as normal loads,
    // and data-reuse meant that stand-alone load instructions were used anyway.
    // so we're not even losing out on folding loads into other insns
    // An inner-loop stride of only 4B is a huge win if memory / cache bandwidth is a bottleneck
    // even without AVX, the shufps instructions are cheap,
    // and don't compete with add/mul for execution units on Intel
    
    for (int j = 0; j < N; j++) {
      __m128 xj_v = _mm_set1_ps(x[j]);
      __m128 rx_v = _mm_sub_ps(xj_v, xi_v);

      __m128 yj_v = _mm_set1_ps(y[j]);
      __m128 ry_v = _mm_sub_ps(yj_v, yi_v);

      __m128 zj_v = _mm_set1_ps(z[j]);
      __m128 rz_v = _mm_sub_ps(zj_v, zi_v);

      __m128 mj_v = _mm_set1_ps(m[j]);   // do the load early

      // sum of squared differences
      __m128 r2_v = _mm_set1_ps(eps) + rx_v*rx_v + ry_v*ry_v + rz_v*rz_v;   // GNU extension
      /* __m128 r2_v = _mm_add_ps(_mm_set1_ps(eps), _mm_mul_ps(rx_v, rx_v));
      r2_v = _mm_add_ps(r2_v, _mm_mul_ps(ry_v, ry_v));
      r2_v = _mm_add_ps(r2_v, _mm_mul_ps(rz_v, rz_v));
      */

      // rsqrt and a Newton-Raphson iteration might have lower latency
      // but there's enough surrounding instructions and cross-iteration parallelism
      // that the single-uop sqrtps and divps instructions prob. aren't be a bottleneck
      __m128 r2sqrt = _mm_sqrt_ps(r2_v);
      __m128 r6sqrt = _mm_mul_ps(r2_v, r2sqrt);  // v^(3/2) = sqrt(v)^3 = sqrt(v)*v
      __m128 s_v = _mm_div_ps(mj_v, r6sqrt);

      __m128 srx_v = _mm_mul_ps(s_v, rx_v);
      __m128 sry_v = _mm_mul_ps(s_v, ry_v);
      __m128 srz_v = _mm_mul_ps(s_v, rz_v);

      axi_v = _mm_add_ps(axi_v, srx_v);
      ayi_v = _mm_add_ps(ayi_v, sry_v);
      azi_v = _mm_add_ps(azi_v, srz_v);
    }
    _mm_store_ps(&ax[i], axi_v);
    _mm_store_ps(&ay[i], ayi_v);
    _mm_store_ps(&az[i], azi_v);
  }
}

I also tried a version with rcpps, but IDK if it will be faster. Note that with -ffast-math, gcc and clang will convert the division into rcpps + a Newton iteration. (They for some reason don't convert 1.0f/sqrtf(x) into rsqrt + Newton, even in a stand-alone function). clang does a better job, using FMA for the iteration step.

#define USE_RSQRT
#ifndef USE_RSQRT
      // even with -mrecip=vec-sqrt after -ffast-math, this still does sqrt(v)*v, then rcpps
      __m128 r2sqrt = _mm_sqrt_ps(r2_v);
      __m128 r6sqrt = _mm_mul_ps(r2_v, r2sqrt);  // v^(3/2) = sqrt(v)^3 = sqrt(v)*v
      __m128 s_v = _mm_div_ps(mj_v, r6sqrt);
#else
      __m128 r2isqrt = rsqrt_float4_single(r2_v);
      // can't use the sqrt(v)*v trick, unless we either do normal sqrt first then rcpps
      // or rsqrtps and rcpps.  Maybe it's possible to do a Netwon Raphson iteration on that product
      // instead of refining them both separately?
      __m128 r6isqrt = r2isqrt * r2isqrt * r2isqrt;
      __m128 s_v = _mm_mul_ps(mj_v, r6isqrt);
#endif

This looks very very interesting and will test it out shortly. However I do have a question. What exactly does -ffp-contract=fast do? Is it a relaxed math option? — Kieran Lavelle, Mar 02 '16 at 09:05
@KieranLavelle: It allows add/sub and mul to contract into FMA, without the intervening rounding step. Nothing more, nothing less. It doesn't enable any other `-ffast-math` stuff that can change results in other ways. Later in that paragraph, I linked to a Q&A about contracting to FMA, so see that for moe. — Peter Cordes, Mar 02 '16 at 09:13
I took your advice at the very top of your post as a first step under "this looks wrong". After doing the tweaking you suggested and the comment about _mm_add_ps I was able to get it working, however my final answer is out by around 4.2% when I don't use the rsqrt function and about 6% when I do. Where can this inaccuracy be accounted for? Especially when I don't use the rsqrt function. — Kieran Lavelle, Mar 02 '16 at 10:51
@KieranLavelle: 4.2% compared to what? Compared to the scalar code? No `-ffast-math` in either case? 64bit code in both cases, so it's not using x87 with higher internal precision? Subtracting two nearby numbers is a big problem for floating point, known as catastrophic cancellation. The scalar code should have exactly the same problem, though. — Peter Cordes, Mar 02 '16 at 11:07
It *might* help a bit to up-convert to double-precision for the sum of squared differences step. You can probably convert back down to `float` before adding `eps`. (`eps` is to avoid division by zero for `i == j` and other cases, right?) I'm not a numerics guy, so IDK if `1.0/sqrt` and then multiplying has any implications for precision vs. doing `m[j] / sqrt`. All arithmetic (other than `rsqrt` / `rcp`) is correctly rounded, so rounding error at each step is at worst +/- 0.5ulp. You should really look at my final version: I'm pretty convinced it's a big win for cache reasons. — Peter Cordes, Mar 02 '16 at 11:15
This looks fantastic. Thanks for the advice it's not too big an issue as the answer I am generating is an estimate anyway but the information you have posted is very useful. — Kieran Lavelle, Mar 02 '16 at 11:25
@KieranLavelle: Have a look at the `#ifdef`ed version with/without `rsqrt`, and see which does better (without `-ffast-math`). I'd love to hear any practical performance feedback, since I often make perf suggestions in SO questions but don't get feedback to make sure I'm giving good advice. :P Since I often don't take the time to put a timing framework around it and test it on my own SnB CPU. — Peter Cordes, Mar 02 '16 at 11:29
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/105174/discussion-between-kieran-lavelle-and-peter-cordes). — Kieran Lavelle, Mar 02 '16 at 17:34

Vectorizing a loop over float x,y,z arrays calculating length and differences using SSE Intrinsics

1 Answers1

This looks wrong:

Linked