Normalize lower triangular matrix more quickly

Question

The code below seems not the bottleneck.

I am just curious to know if there is a faster way to get this done on a cpu with SSE4.2.

The code works on the lower triangular entries of a matrix stored as a 1d array in the following form in ar_tri:

[ (1,0),
  (2,0),(2,1),
  (3,0),(3,1),(3,2), 
  ...,
  (n,0)...(n,n-1) ]

where (x,y) is the entries of the matrix at the xth row and yth column.

And also the reciprocal square root (rsqrt) of the diagonal of the matrix of the following form in ar_rdia:

[ rsqrt(0,0), rsqrt(1,1), ... ,rsqrt(n,n) ]

gcc6.1 -O3 on the Godbolt compiler explorer auto-vectorizes both versions using SIMD instructions (mulps). The triangular version has cleanup code at the end of each row, so there are some scalar instructions, too.

Would using rectangular matrix stored as a 1d array in contiguous memory improve the performance?

// Triangular version
#include <iostream>
#include <stdlib.h>
#include <stdint.h>
using namespace std;
int main(void){
    size_t n = 10000;
    size_t n_tri = n*(n-1)/2;

    size_t repeat = 10000;
        // test 10000 cycles of the code
    float* ar_rdia = (float*)aligned_alloc(16, n*sizeof(float));
        //reciprocal square root of diagonal
    float* ar_triangular = (float*)aligned_alloc(16, n_tri*sizeof(float));
        //lower triangular matrix

    size_t i,j,k;
    float a,b;
    k = 0;
    for(i = 0; i < n; ++i){
        for(j = 0; j < i; ++j){
           ar_triangular[k] *= ar_rdia[i]*ar_rdia[j];
           ++k;
        }
    }

    cout << k;
    free((void*)ar_rdia);
    free((void*)ar_triangular);
}

// Square version
#include <iostream>
#include <stdlib.h>
#include <stdint.h>
using namespace std;
int main(void){
    size_t n = 10000;
    size_t n_sq = n*n;

    size_t repeat = 10000;
        // test 10000 cycles of the code
    float* ar_rdia = (float*)aligned_alloc(16, n*sizeof(float));
        //reciprocal square root of diagonal
    float* ar_square = (float*)aligned_alloc(16, n_sq*sizeof(float));
        //lower triangular matrix

    size_t i,j,k;
    float a,b;
    k = 0;
    for(i = 0; i < n; ++i){
        for(j = 0; j < n; ++j){
           ar_square[k] *= ar_rdia[i]*ar_rdia[j];
           ++k;
        }
    }
    cout << k;
    free((void*)ar_rdia);
    free((void*)ar_square);
}

assembly output:

## Triangular version
main:
        ...
        call    aligned_alloc
        movl    $1, %edi
        movq    %rax, %rbp
        xorl    %esi, %esi
        xorl    %eax, %eax
.L2:
        testq   %rax, %rax
        je      .L3
        leaq    -4(%rax), %rcx
        leaq    -1(%rax), %r8
        movss   (%rbx,%rax,4), %xmm0
        shrq    $2, %rcx
        addq    $1, %rcx
        cmpq    $2, %r8
        leaq    0(,%rcx,4), %rdx
        jbe     .L9
        movaps  %xmm0, %xmm2
        leaq    0(%rbp,%rsi,4), %r10
        xorl    %r8d, %r8d
        xorl    %r9d, %r9d
        shufps  $0, %xmm2, %xmm2       # broadcast ar_rdia[i]
.L6:                                # vectorized loop
        movaps  (%rbx,%r8), %xmm1
        addq    $1, %r9
        mulps   %xmm2, %xmm1
        movups  (%r10,%r8), %xmm3
        mulps   %xmm3, %xmm1
        movups  %xmm1, (%r10,%r8)
        addq    $16, %r8
        cmpq    %rcx, %r9
        jb      .L6
        cmpq    %rax, %rdx
        leaq    (%rsi,%rdx), %rcx
        je      .L7
.L4:                                      # scalar cleanup
        movss   (%rbx,%rdx,4), %xmm1
        leaq    0(%rbp,%rcx,4), %r8
        leaq    1(%rdx), %r9
        mulss   %xmm0, %xmm1
        cmpq    %rax, %r9
        mulss   (%r8), %xmm1
        movss   %xmm1, (%r8)
        leaq    1(%rcx), %r8
        jnb     .L7
        movss   (%rbx,%r9,4), %xmm1
        leaq    0(%rbp,%r8,4), %r8
        mulss   %xmm0, %xmm1
        addq    $2, %rdx
        addq    $2, %rcx
        cmpq    %rax, %rdx
        mulss   (%r8), %xmm1
        movss   %xmm1, (%r8)
        jnb     .L7
        mulss   (%rbx,%rdx,4), %xmm0
        leaq    0(%rbp,%rcx,4), %rcx
        mulss   (%rcx), %xmm0
        movss   %xmm0, (%rcx)
.L7:
        addq    %rax, %rsi
        cmpq    $10000, %rdi
        je      .L16
.L3:
        addq    $1, %rax
        addq    $1, %rdi
        jmp     .L2
.L9:
        movq    %rsi, %rcx
        xorl    %edx, %edx
        jmp     .L4
.L16:
        ... print and free
        ret

The interesting part of the assembly for the square case:

main:
        ... allocate both arrays
        call    aligned_alloc
        leaq    40000(%rbx), %rsi
        movq    %rax, %rbp
        movq    %rbx, %rcx
        movq    %rax, %rdx
.L3:                                       # loop over i
        movss   (%rcx), %xmm2
        xorl    %eax, %eax
        shufps  $0, %xmm2, %xmm2           # broadcast ar_rdia[i]
.L2:                                       # vectorized loop over j
        movaps  (%rbx,%rax), %xmm0
        mulps   %xmm2, %xmm0
        movups  (%rdx,%rax), %xmm1
        mulps   %xmm1, %xmm0
        movups  %xmm0, (%rdx,%rax)
        addq    $16, %rax
        cmpq    $40000, %rax
        jne     .L2
        addq    $4, %rcx             # no scalar cleanup: gcc noticed that the row length is a multiple of 4 elements
        addq    $40000, %rdx
        cmpq    %rsi, %rcx
        jne     .L3

        ... print and free
        ret

Since you have a very specific question, why don't you try it? — Dutow, Jul 28 '16 at 19:41
I guess square array is better. compared to the triangular, half of the entries is not needed. But sse can give four times performance. So, overall, it is still better. — rxu, Jul 28 '16 at 19:41
Could you post the exact compiler command line you used? For me, the code generated from your example uses `movss` and `mulss`, both SSE instructions. — Dutow, Jul 28 '16 at 19:50
I thought mulps is better than mulss. The square version can use mulps, the triangular version can't. — rxu, Jul 28 '16 at 19:54
Your original question said the triangular version didn't use sse instructions - which you corrected since. For the "how is it better with SSE question:" measure, measure, and measure. And add automatic measure tests into your CI server. And measure more. — Dutow, Jul 28 '16 at 20:00
You can transform a triangular loop into a rectangle using the same number of iterations. See the end of my answer [here](http://stackoverflow.com/a/33836073/2542702). I don't think this will help for SIMD because I think you would still need efficient gather operations but it works well for threading. — Z boson, Jul 29 '16 at 09:58
@Dutow and rxu: [`mulss` is a Scalar Single-precision multiply](http://www.felixcloutier.com/x86/MULSS.html), using xmm registers. It's NOT a SIMD instruction, it's just the modern (SSE) way to do scalar FP math. SSE2 makes x87 obsolete for `float` and `double`. — Peter Cordes, Jul 29 '16 at 13:25
@Z boson: The fused loop seems to use a division and % operation for each element. Those two operations makes an overhead that takes more time than the actual calculation. What are the benefits of a fused loop? I haven't thought about the cache, but may be it is possible to pair up the n rows like (0th and (n-1)th row), (1st and (n-2)th row) ... That way uses only addition and subtraction, but also need to write more code for the remainder rows. — rxu, Jul 29 '16 at 14:52
@Zboson: I don't see a problem for SIMD for this, other than a bit of cleanup code to handle the end of each row. There are multiple options for how to implement that. There's no gathering needed, just broadcast-loads for `ar_rdia[i]`. — Peter Cordes, Jul 29 '16 at 14:55

score 2 · Accepted Answer · edited May 23 '17 at 10:28

The loop that stores to the triangular array should vectorize ok, with inefficiencies at the end of each row. gcc actually did auto-vectorize both, according to the asm you posted. I wish I'd looked at that first instead of taking your word for it that it needed to be manually vectorized. :(

.L6:   # from the first asm dump.
    movaps  (%rbx,%r8), %xmm1
    addq    $1, %r9
    mulps   %xmm2, %xmm1
    movups  (%r10,%r8), %xmm3
    mulps   %xmm3, %xmm1
    movups  %xmm1, (%r10,%r8)
    addq    $16, %r8
    cmpq    %rcx, %r9
    jb      .L6

This looks exactly like the inner loop that my manual vectorized version would compile to. The .L4 is fully-unrolled scalar cleanup for the last up-to-3 elements of a row. (So it's probably not quite as good as my code). Still, it's quite decent, and auto-vectorization will let you take advantage of AVX and AVX512 with no source changes.

I edited your question to include a link to the code on godbolt, with both versions as separate functions. I didn't take the time to convert them to taking the arrays as function args, because then I'd have to take time to get all the __restrict__ keywords right, and to tell gcc that the arrays are aligned on a 4B * 16 = 64 byte boundary, so it can use aligned loads if it wants to.

Within a row, you're using the same ar_rdia[i] every time, so you broadcast that into a vector once at the start of the row. Then you just do vertical operations between the source ar_rdia[j + 0..3] and destination ar_triangular[k + 0..3].

To handle the last few elements at the end of a row that aren't a multiple of the vector size, we have two options:

scalar (or narrower vector) fallback / cleanup after the vectorized loop, handling the last up-to-3 elements of each row.
unroll the loop over i by 4, and use an optimal sequence for handling the odd 0, 1, 2, and 3 elements left at the end of a row. So the loop over j will be repeated 4 times, with fixed cleanup after each one. This is probably the most optimal approach.
have the final vector iteration overshoot the end of a row, instead of stopping after the last full vector. So we overlap the start of the next row. Since your operation is not idempotent, this option doesn't work well. Also, making sure k is updated correctly for the start of the next row takes a bit of extra code.

Still, this would be possible by having the final vector of a row blend the multiplier so elements beyond the end of the current row get multiplied by 1.0 (the multiplicative identity). This should be doable with a blendvpswith a vector of 1.0 to replace some elements of ar_rdia[i] * ar_rdia[j + 0..3]. We'd also have to create a selector mask (maybe by indexing into an array of int32_t row_overshoot_blend_window {0, 0, 0, 0, -1, -1, -1} using j-i as the index, to take a window of 4 elements). Another option is branching to select either no blend or one of three immediate blends (blendps is faster, and doesn't require a vector control mask, and the branches will have an easily predictable pattern).

This causes a store-forwarding failure at the start of 3 of every 4 rows, when the load from ar_triangular overlaps with the store from the end of the last row. IDK which will perform best.

Another maybe even better option would be to do loads that overshoot the end of the row, and do the math with packed SIMD, but then conditionally store 1 to 4 elements.

Not reading outside the memory you allocate can require leaving padding at the end of your buffer, e.g. if the last row wasn't a multiple of 4 elements.

/****** Normalize a triangular matrix using SIMD multiplies,
   handling the ends of rows with narrower cleanup code *******/

// size_t i,j,k;   // don't do this in C++ or C99.  Put declarations in the narrowest scope possible.  For types without constructors/destructors, it's still a style / human-readability issue

size_t k = 0;
for(size_t i = 0; i < n; ++i){
    // maybe put this inside the for() loop and let the compiler hoist it out, to avoid doing it for small rows where the vector loop doesn't even run once.
    __m128 vrdia_i = _mm_set1_ps(ar_rdia[i]);  // broadcast-load: very efficient with AVX, load+shuffle without.  Only done once per row anyway.
    size_t j = 0;
    for(j = 0; j < (i-3); j+=4){  // vectorize over this loop
       __m128 vrdia_j  = _mm_loadu_ps(ar_rdia + j);
       __m128 scalefac = _mm_mul_ps(vrdia_j, v_rdia_i);

       __m128 vtri       = _mm_loadu_ps(ar_triangular + k);
       __m128 normalized = _mm_mul_ps(scalefac , vtri);
       _mm_storeu_ps(ar_triangular + k, normalized);
       k += 4;
    }
    // scalar fallback / cleanup for the ends of rows.  Alternative: blend scalefac with 1.0 so it's ok to overlap into the next row.

 /*  Fine in theory, but gcc likes to make super-bloated code by auto-vectorizing cleanup loops.  Besides, we can do better than scalar
    for ( ; j < i; ++j ){
       ar_triangular[k] *= ar_rdia[i]*ar_rdia[j];     ++k;   }
 */

    if ((i-j) >= 2) {  // load 2 floats (using movsd to zero the upper 64 bits, so mulps doesn't slow down or raise exceptions on denormals or NaNs
        __m128 vrdia_j = _mm_castpd_ps( _mm_load_sd(static_cast<const double*>(ar_rdia+j)) );
       __m128 scalefac = _mm_mul_ps(vrdia_j, v_rdia_i);

       __m128 vtri       = _mm_castpd_ps( _mm_load_sd(static_cast<const double*>(ar_triangular + k) ));
       __m128 normalized = _mm_mul_ps(scalefac , vtri);
       _mm_storel_pi(static_cast<__m64*>(ar_triangular + k), normalized);      // movlps.  Agner Fog's table indicates that Nehalem decodes this to 2 uops, instead of 1 for movsd.  Bizarre!
       j+=2;
       k+=2;
    }
    if (j<i) {  // last single element
        ar_triangular[k] *= ar_rdia[i]*ar_rdia[j];
        ++k;
        //++j;     // end of the row anyway.  A smart compiler would still optimize it away...
    }
    // another possibility: load 4 elements and do the math, then movss, movsd, movsd + extractps (_mm_extractmem_ps), or movups to store the last 1, 2, 3, or 4 elements of the row.
    // don't use maskmovdqu; it bypasses cache
}

movsd and movlps are equivalent as stores, but not as loads. See this comment thread for discussion of why it makes some sense that the store forms have separate opcodes. Update: Agner Fog's insn tables indicate that Nehalem decodes MOVH/LPS/D to 2 fused-domain uops. They also say that SnB decodes it to 1, but IvB decodes it to 2 uops. That's got to be wrong. For Haswell, his table splits things to separate entries for movlps/d (1 micro-fused uop) and movhps/d (also 1 micro-fused uop). It makes no sense for the store form of movlps to be 2 uops and need the shuffle port on anything; it does exactly the same thing as a movsd store.

If your matrices are really big, don't worry too much about the end-of-row handling. If they're small, more of the total time is going to be spent on the ends of rows, so it's worth trying multiple ways, and having a careful look at the asm.

You could easily compute rsqrt on the fly here if the source data is contiguous. Otherwise yeah, copy just the diagonal into an array (and compute rsqrt while doing that copy, rather than with another pass over that array like your previous question. Either with scalar rsqrtss and no NR step while copying from the diagonal of a matrix into an array, or manually gather elements into a SIMD vector (with _mm_set_ps(a[i][i], a[i+1][i+1], a[i+2][i+2], a[i+3][i+3]) to let the compiler pick the shuffles) and do rsqrtps + a NR step, then store the vector of 4 results to the array.

Small problem sizes: avoiding waste from not doing full vectors at the ends of rows

The very start of the matrix is a special case, because three "ends" are contiguous in the first 6 elements. (The 4th row has 4 elements). It might be worth special-casing this and doing the first 3 rows with two SSE vectors. Or maybe just the first two rows together, and then the third row as a separate group of 3. Actually, a group of 4 and a group of 2 is much more optimal, because SSE can do those 8B and 16B loads/stores, but not 12B.

The first 6 scale factors are products of the first three elements of ar_rdia, so we can do a single vector load and shuffle it a couple ways.

ar_rdia[0]*ar_rdia[0]
ar_rdia[1]*ar_rdia[0], ar_rdia[1]*ar_rdia[1],
ar_rdia[2]*ar_rdia[0], ar_rdia[2]*ar_rdia[1], ar_rdia[2]*ar_rdia[2]
                     ^
               end of first vector of 4 elems, start of 2nd.

It turns out compilers aren't great at spotting and taking advantage of the patterns here, so to get optimal code for the first 10 elements here, we need to peel those iterations and optimize the shuffles and multiplies manually. I decided to do the first 4 rows, because the 4th row still reuses that SIMD vector of ar_rdia[0..3]. That vector even still gets used by the first vector-width of row 4 (the fifth row).

Also worth considering: doing 2, 4, 4 instead of this 4, 2, 4.

void triangular_first_4_rows_manual_shuffle(float *tri, const float *ar_rdia)
{
  __m128 vr0 = _mm_load_ps(ar_rdia);      // we know ar_rdia is aligned

  // elements 0-3    // row 0, row 1, and the first element of row 2
  __m128 vi0 = _mm_shuffle_ps(vr0, vr0, _MM_SHUFFLE(2, 1, 1, 0));
  __m128 vj0 = _mm_shuffle_ps(vr0, vr0, _MM_SHUFFLE(0, 1, 0, 0));
  __m128 sf0  = vi0 * vj0;  // equivalent to _mm_mul_ps(vi0, vj0); // gcc defines __m128 in terms of GNU C vector extensions
  __m128 vtri = _mm_load_ps(tri);
  vtri *= sf0;
  _mm_store_ps(tri, vtri);
  tri += 4;

  // elements 4 and 5, last two of third row
  __m128 vi4 = _mm_shuffle_ps(vr0, vr0, _MM_SHUFFLE(3, 3, 2, 2));   // can compile into unpckhps, saving a byte.  Well spotted by clang
  __m128 vj4 = _mm_movehl_ps(vi0, vi0);   // save a mov by reusing a previous shuffle output, instead of a fresh _mm_shuffle_ps(vr0, vr0, _MM_SHUFFLE(2, 1, 2, 1)); // also saves a code byte (no immediate)
  // actually, a movsd from ar_ria+1 would get these two elements with no shuffle.  We aren't bottlenecked on load-port uops, so that would be good.
  __m128 sf4 = vi4 * vj4;
  //sf4 = _mm_movehl_ps(sf4, sf4);        // doesn't save anything compared to shuffling before multiplying
  // could use movhps to load and store *tri to/from the high half of an xmm reg, but each of those takes a shuffle uop
  // so we shuffle the scale-factor down to the low half of a vector instead.
  __m128 vtri4  = _mm_castpd_ps(_mm_load_sd((const double*)tri));  // elements 4 and 5
  vtri4 *= sf4;
  _mm_storel_pi((__m64*)tri, vtri4);  // 64bit store.  Possibly slower than movsd if Agner's tables are right about movlps, but I doubt it
  tri += 2;

  // elements 6-9 = row 4, still only needing elements 0-3 of ar_rdia
  __m128 vi6 = _mm_shuffle_ps(vr0, vr0, _MM_SHUFFLE(3, 3, 3, 3));  // broadcast.  clang puts this ahead of earlier shuffles.  Maybe we should put this whole block early and load/store this part of tri, too.
  //__m128 vi6 = _mm_movehl_ps(vi4, vi4);
  __m128 vj6 = vr0; // 3, 2, 1, 0 already in the order we want
  __m128 vtri6 = _mm_loadu_ps(tri+6);
  vtri6 *= vi6 * vj6;
  _mm_storeu_ps(tri+6, vtri6);
  tri += 4;
  // ... first 4 rows done
}

gcc and clang compile this very similarly with -O3 -march=nehalem (to enable SSE4.2 but not AVX). See the code on Godbolt, with some other versions that don't compile as nicely:

    # gcc 5.3
    movaps  xmm0, XMMWORD PTR [rsi]     # D.26921, MEM[(__v4sf *)ar_rdia_2(D)]
    movaps  xmm1, xmm0  # tmp108, D.26921
    movaps  xmm2, xmm0  # tmp111, D.26921
    shufps  xmm1, xmm0, 148     # tmp108, D.26921,
    shufps  xmm2, xmm0, 16      # tmp111, D.26921,
    mulps   xmm2, xmm1    # sf0, tmp108
    movhlps xmm1, xmm1        # tmp119, tmp108
    mulps   xmm2, XMMWORD PTR [rdi]       # vtri, MEM[(__v4sf *)tri_5(D)]
    movaps  XMMWORD PTR [rdi], xmm2     # MEM[(__v4sf *)tri_5(D)], vtri
    movaps  xmm2, xmm0  # tmp116, D.26921
    shufps  xmm2, xmm0, 250     # tmp116, D.26921,
    mulps   xmm1, xmm2    # sf4, tmp116
    movsd   xmm2, QWORD PTR [rdi+16]      # D.26922, MEM[(const double *)tri_5(D) + 16B]
    mulps   xmm1, xmm2    # vtri4, D.26922
    movaps  xmm2, xmm0  # tmp126, D.26921
    shufps  xmm2, xmm0, 255     # tmp126, D.26921,
    mulps   xmm0, xmm2    # D.26925, tmp126
    movlps  QWORD PTR [rdi+16], xmm1    #, vtri4
    movups  xmm1, XMMWORD PTR [rdi+48]  # tmp129,
    mulps   xmm0, xmm1    # vtri6, tmp129
    movups  XMMWORD PTR [rdi+48], xmm0  #, vtri6
    ret

Only 22 total instructions for the first 4 rows, and 4 of them are movaps reg-reg moves. (clang manages with only 3, with a total of 21 instructions). We'd probably save one by getting [ x x 2 1 ] into a vector with a movsd from ar_rdia+1, instead of yet another movaps + shuffle. And reduce pressure on the shuffle port (and ALU uops in general).

With AVX, clang uses vpermilps for most shuffles, but that just wastes a byte of code-size. Unless it saves power (because it only has 1 input), there's no reason to prefer its immediate form over shufps, unless you can fold a load into it.

I considered using palignr to always go 4-at-a-time through the triangular matrix, but that's almost certainly worse. You'd need those palignrs all the time, not just at the ends.

I think extra complexity / narrower loads/stores at the ends of rows is just going to give out-of-order execution something to do. For large problem sizes, you'll spend most of the time doing 16B at a time in the inner loop. This will probably bottleneck on memory, so less memory-intensive work at the ends of rows is basically free as long as out-of-order execution keeps pulling cache-lines from memory as fast as possible.

So triangular matrices are still good for this use case; keeping your working set dense and in contiguous memory seems good. Depending on what you're going to do next, this might or might not be ideal overall.

I am sorry for the mistake about vectorization of triangular loop. I haven't read/written programs in assembly before and didn't see the triangular loop was also vectorized. — rxu, Jul 29 '16 at 16:18
@rxu: Mistakes happen, esp. while you're learning. It was annoying for me, but I'll get over it; The apology helps with that, thanks :). BTW, you could have tested with a benchmark and found that `-O3 -fno-tree-vectorize` was a big slowdown, which would tell you that auto-vectorization did something useful. BTW, do you really only have a Nehalem CPU (guessing based on SSE4.2)? If you're doing any serious number crunching, auto-vectorization with AVX will give you twice the throughput when you're not bottlenecked on memory. — Peter Cordes, Jul 29 '16 at 16:24
Anyway, optimally handling the end of a row was a somewhat interesting problem to think about, and is still relevant (esp. for small problem sizes) since gcc's autovectorization just does a pure scalar unroll. — Peter Cordes, Jul 29 '16 at 16:30
@rxu: I just had an even better idea: unroll the loop over i by 4 to trivialize cycling through the 4 different cleanup patterns. — Peter Cordes, Jul 29 '16 at 16:37
@rxu: Peeling the first 4 rows and manually vectorizing that is interesting. They all use entries from the first 4 elements of `ar_rdia`, so we can just load that once and keep shuffling it. — Peter Cordes, Jul 29 '16 at 20:10

Normalize lower triangular matrix more quickly

assembly output:

1 Answers1

Small problem sizes: avoiding waste from not doing full vectors at the ends of rows