How to maximise instruction level parallelism of sqrt-heavy-loop on skylake architecture?

Question

To introduced myself to x86 intrinsics (and cache friendliness to a lesser extent) I explicitly vectorized a bit of code I use for RBF (radial basis function) -based grid deformation. Having found vsqrtpd to be the major bottleneck I want to know if/how I can mask its latency further. This is the scalar computational kernel:

for(size_t i=0; i<nPt; ++i)
{
    double xi = X[i], yi = X[i+nPt], zi = X[i+2*nPt];

   for(size_t j=0; j<nCP; ++j)
   {
        // compute distance from i to j
        double d = sqrt(pow(xi-Xcp[   j   ],2)+
                        pow(yi-Xcp[ j+nCP ],2)+
                        pow(zi-Xcp[j+2*nCP],2));

        // compute the RBF kernel coefficient
        double t = max(0.0,1.0-d);
        t = pow(t*t,2)*(1.0+4.0*d);

        // update coordinates
        for(size_t k=0; k<nDim; ++k) X[i+k*nPt] += t*Ucp[j+k*nCP];
    }
}

nPt is the number of target coordinates and it is much larger than nCP the number of source coordinates/displacements. The latter fit in L3 and so the inner-most loop is always over source points.

First optimization step was to work on 4 target points simultaneously. Source point data was still accessed via scalar loads followed by broadcast.
Second step was to target L1 by blocking the loops, blocking the i-loop was somehow much more important than blocking the j-loop, which gave only a marginal improvement. Inner-most loop is still over j to reduce load/stores.
Third was to load 4 control points and use shuffle/permute to go over the 4 combination of i-j instead of using broadcast.
Fourth, after observing that omitting the square root gives a 1.5x speed up (to about 70% the FP performance of a large LLT on an i7-7700), was to dedicate 4 registers to the computation of the 4 square roots to (maybe?) allow some other computation to take place... 1% improvement vs third step.

Current code

void deform(size_t nPt, size_t nCP, const double* Xcp, const double* Ucp, double* X)
{
    const size_t SIMDLEN = 4;

    // tile ("cache block") sizes
    const size_t TILEH = 512;
    const size_t TILEW = 256;

    // fill two registers with the constants we need
    __m256d vone  = _mm256_set1_pd(1.0),
            vfour = _mm256_set1_pd(4.0);

    // explicitly vectorized (multiple i's at a time) and blocked
    // outer most loop over sets of #TILEH points
    for(size_t i0=0; i0<nPt; i0+=TILEH)
    {
        // displacement buffer, due to tiling, coordinates cannot be modified in-place
        alignas(64) double U[3*TILEH*sizeof(double)];

        // zero the tile displacements
        for(size_t k=0; k<3*TILEH; k+=SIMDLEN)
            _mm256_store_pd(&U[k], _mm256_setzero_pd());

        // stop point for inner i loop
        size_t iend = min(i0+TILEH,nPt);

        // second loop over sets of #TILEW control points
        for(size_t j0=0; j0<nCP; j0+=TILEW)
        {
            // stop point for inner j loop
            size_t jend = min(j0+TILEW,nCP);

            // inner i loop, over #TILEH points
            // vectorized, operate on #SIMDLEN points at a time
            for(size_t i=i0; i<iend; i+=SIMDLEN)
            {
                // coordinates and displacements of points i
                __m256d wi,
                xi = _mm256_load_pd(&X[   i   ]),
                yi = _mm256_load_pd(&X[ i+nPt ]),
                zi = _mm256_load_pd(&X[i+2*nPt]),
                ui = _mm256_load_pd(&U[    i-i0    ]),
                vi = _mm256_load_pd(&U[ i-i0+TILEH ]);
                wi = _mm256_load_pd(&U[i-i0+2*TILEH]);

                // inner j loop, over #TILEW control points, vectorized loads
                for(size_t j=j0; j<jend; j+=SIMDLEN)
                {
                    // coordinates of points j, and an aux var
                    __m256d t,
                    xj = _mm256_load_pd(&Xcp[   j   ]),
                    yj = _mm256_load_pd(&Xcp[ j+nCP ]),
                    zj = _mm256_load_pd(&Xcp[j+2*nCP]);

                    // compute the possible 4 distances from i to j...
                    #define COMPUTE_DIST(D) __m256d                         \
                    D = _mm256_sub_pd(xi,xj);  D = _mm256_mul_pd(D,D);      \
                    t = _mm256_sub_pd(yi,yj);  D = _mm256_fmadd_pd(t,t,D);  \
                    t = _mm256_sub_pd(zi,zj);  D = _mm256_fmadd_pd(t,t,D);  \
                    D = _mm256_sqrt_pd(D)

                    // ...by going through the different permutations
                    #define SHUFFLE(FUN,IMM8)   \
                    xj = FUN(xj,xj,IMM8);       \
                    yj = FUN(yj,yj,IMM8);       \
                    zj = FUN(zj,zj,IMM8)

                    COMPUTE_DIST(d0);

                    SHUFFLE(_mm256_shuffle_pd,0b0101);
                    COMPUTE_DIST(d1);

                    SHUFFLE(_mm256_permute2f128_pd,1);
                    COMPUTE_DIST(d2);

                    SHUFFLE(_mm256_shuffle_pd,0b0101);
                    COMPUTE_DIST(d3);

                    // coordinate registers now hold the displacements
                    xj = _mm256_load_pd(&Ucp[   j   ]),
                    yj = _mm256_load_pd(&Ucp[ j+nCP ]);
                    zj = _mm256_load_pd(&Ucp[j+2*nCP]);

                    // coefficients for each set of distances...
                    #define COMPUTE_COEFF(C)                                \
                    t = _mm256_min_pd(vone,C);  t = _mm256_sub_pd(vone,t);  \
                    t = _mm256_mul_pd(t,t);     t = _mm256_mul_pd(t,t);     \
                    C = _mm256_fmadd_pd(vfour,C,vone);                      \
                    C = _mm256_mul_pd(t,C)

                    // ...+ update i point displacements
                    #define UPDATE_DISP(C)          \
                    COMPUTE_COEFF(C);               \
                    ui = _mm256_fmadd_pd(C,xj,ui);  \
                    vi = _mm256_fmadd_pd(C,yj,vi);  \
                    wi = _mm256_fmadd_pd(C,zj,wi)

                    UPDATE_DISP(d0);

                    SHUFFLE(_mm256_shuffle_pd,0b0101);
                    UPDATE_DISP(d1);

                    SHUFFLE(_mm256_permute2f128_pd,1);
                    UPDATE_DISP(d2);

                    SHUFFLE(_mm256_shuffle_pd,0b0101);
                    UPDATE_DISP(d3);
                }

                // store updated displacements
                _mm256_store_pd(&U[    i-i0    ], ui);
                _mm256_store_pd(&U[ i-i0+TILEH ], vi);
                _mm256_store_pd(&U[i-i0+2*TILEH], wi);
            }
        }

        // add tile displacements to the coordinates
        for(size_t k=0; k<3; ++k)
        {
            for(size_t i=i0; i<iend; i+=SIMDLEN)
            {
                __m256d
                x = _mm256_load_pd(&X[i+k*nPt]),
                u = _mm256_load_pd(&U[i-i0+k*TILEH]);
                x = _mm256_add_pd(x,u);
                _mm256_stream_pd(&X[i+k*nPt], x);
            }
        }
    }
}

So what more can I do to it? Or, am I doing something very wrong?

Thank you, P. Gomes

Have you check perf counters for `arith.divider_active` ~= core clock cycles? If so, you're saturating the (not fully pipelined) divider throughput and there's not much left to gain, unless you can avoid some sqrts. — Peter Cordes, Aug 27 '19 at 15:24
Do you absoutely need the accuracy of doubles? If floats are good enough then there's a great discussion [here](https://stackoverflow.com/questions/31555260/fast-vectorized-rsqrt-and-reciprocal-with-sse-avx-depending-on-precision) — Mike Vine, Aug 27 '19 at 15:39
@MikeVine: With this much work between sqrt operations, using hardware single-precision sqrt is probably best. Bare `x * rsqrtps(x)` without Newton Raphson is probably too inaccurate (and breaks on x==0), but an NR iteration takes too many extra FMA uops to be worth it. Skylake has amazingly good single-precision `vsqrtps` throughput: one per 6 cycles. (vs. `vsqrtpd` at one per 9 to 12 cycles). Last time I tuned a function including a sqrt of a polynomial-approximation for Skylake, fast-approx reciprocals weren't worth it. But yes moving to `float` gives you even more than 2x speedup here — Peter Cordes, Aug 28 '19 at 08:39
Instead of doing a Newton-Raphson refinement and passing that to the `COMPUTE_COEFF` function, you could also consider finding a min-max polynomial which depends on `d_squared` and `rsqrt(max(min(d_squared, 1.0),min_normal))`, i.e., combining the refinement with the polynomial evaluation. Also, if you had lots of point-pairs where `d>=1`, you could consider skipping the calculation for those (probably only makes sense, if they come in blocks large enough to be noticed by the branch-prediction). — chtz, Aug 28 '19 at 14:17
Minor note: For slightly better accuracy, instead of loading `ui`, `vi`, `wi` at the start of the inner loop, I would set them to zero and add the previous value at the end of the loop (should not make any performance difference). — chtz, Aug 28 '19 at 14:20
Also not your actual question: You could get away without any shuffling, by loading and storing `xi`,`yi`,`zi`/`ui`,`vi`,`wi` with increments of 1, instead of 4 (this requires some extra care for handling the very first and last elements). A middle-way would be to load them with increments of 2 and just do one in-lane shuffle of the `*j` parameters (or in fact shuffle the `*i` parameters, which could be done outside the loop, if you have enough registers). — chtz, Aug 28 '19 at 14:32
@PeterCordes, thank you! I was so caught up in micro optimizing I forgot rule number 1... measure. 98% of the function runtime can be explained by taking the number of square roots and the operation throughput. — P Gomes, Aug 29 '19 at 10:06
@chtz thank you for the note on accuracy, that might become important if I switch to single precision. — P Gomes, Aug 29 '19 at 10:13
@chtz I tried the middle way strategy you suggested and performance went down a bit, possibly due to the unaligned loads, port 5 pressure does not seem to be an issue here. — P Gomes, Aug 29 '19 at 10:31

score 3 · Accepted Answer · answered Aug 29 '19 at 11:14

First check perf counters for arith.divider_active being ~= core clock cycles.

98% of the function runtime can be explained by taking the number of square roots and the operation throughput.

Or that works too.

If that's the case, you're saturating the (not fully pipelined) divider throughput and there's not much left to gain from just exposing more ILP.

Algorithmic changes are your only real chance to gain anything, e.g avoid some sqrt operations or use single-precision.

Single-precision gives you 2x as much work per vector for free. But for sqrt-heavy workloads there's an additional gain: vsqrtps throughput per vector is usually better than vsqrtpd. That's the case on Skylake: one per 6 cycles vs. vsqrtpd at one per 9 to 12 cycles. That could move the bottleneck away from the sqrt/divide unit, perhaps to the front-end or the FMA unit.

vrsqrtps has been suggested in comments. That would be worth considering (if single-precision is an option), but it's not a clear win when you need a Newton Raphson iteration to get enough precision. Bare x * rsqrtps(x) without Newton Raphson is probably too inaccurate (and needs a cmp/AND to work around x==0.0), but an NR iteration can take too many extra FMA uops to be worth it.

(AVX512 with vrsqrt14ps/pd has more precision in the approximation, but usually still not enough to use without Newton. But interestingly it does exist for double-precision. Of course if you're on Xeon Phi, sqrt is very slow and you're intended to use AVX512ER vrsqrt28pd + Newton, or just vrsqrt28ps on its own.)

Last time I tuned a function including a sqrt of a polynomial-approximation for Skylake, fast-approx reciprocals weren't worth it. Hardware single-precision sqrt was the best choice that gave us the required precision (and we weren't even considering needing double). There was more work than yours between sqrt operations, though.

How to maximise instruction level parallelism of sqrt-heavy-loop on skylake architecture?

1 Answers1