Faster approximate reciprocal square root of an array

Question

How to calculate approximate reciprocal square root of an array faster on a cpu with popcnt and SSE4.2?

The input is positive integers (ranges from 0 to about 200,000) stored in an array of floats.

The output is an array of floats.

Both arrays have correct memory alignment for sse.

The code below only use 1 xmm register, runs on linux, and can be compiled by gcc -O3 code.cpp -lrt -msse4.2

Thank you.

#include <iostream>
#include <emmintrin.h>
#include <time.h>

using namespace std;
void print_xmm(__m128 xmm){
    float out[4];
    _mm_storeu_ps(out,xmm);
    int i;
    for (i = 0; i < 4; ++i) std::cout << out[i] << " ";
    std::cout << std::endl;
}

void print_arr(float* ptr, size_t size){
    size_t i;
    for(i = 0; i < size; ++i){
        cout << ptr[i] << " ";
    }
    cout << endl;
}

int main(void){
    size_t size = 25000 * 4;
        // this has to be multiple of 4
    size_t repeat = 10000;
        // test 10000 cycles of the code
    float* ar_in = (float*)aligned_alloc(16, size*sizeof(float));
    float* ar_out = (float*)aligned_alloc(16, size*sizeof(float));
    //fill test data into the input array
    //the data is an array of positive numbers.
    size_t i;
    for (i = 0; i < size; ++i){
        ar_in[i] = (i+1) * (i+1);
    }
    //prepare for recipical square root.
    __m128 xmm0;
    size_t size_fix = size*sizeof(float)/sizeof(__m128);
    float* ar_in_end  = ar_in + size_fix;
    float* ar_out_now;
    float* ar_in_now;
    //timing
    struct timespec tp_start, tp_end;
    i = repeat;
    clock_gettime(CLOCK_MONOTONIC, &tp_start);
    //start timing
    while(--i){
        ar_out_now = ar_out;
        for(ar_in_now = ar_in;
            ar_in_now != ar_in_end;
            ar_in_now += 4, ar_out_now+=4){
            //4 = sizeof(__m128)/sizeof(float);
            xmm0 = _mm_load_ps(ar_in_now);
            //cout << "load xmm: ";
            //print_xmm(xmm0);
            xmm0 = _mm_rsqrt_ps(xmm0);
            //cout << "rsqrt xmm: ";
            //print_xmm(xmm0);
            _mm_store_ps(ar_out_now,xmm0);
        }
    }
    //end timing
    clock_gettime(CLOCK_MONOTONIC, &tp_end);
    double timing;
    const double nano = 0.000000001;

    timing = ((double)(tp_end.tv_sec  - tp_start.tv_sec )
           + (tp_end.tv_nsec - tp_start.tv_nsec) * nano)/repeat;

    cout << "    timing per cycle: " << timing << endl;
    /* 
    cout << "input array: ";
    print_arr(ar_in, size);
    cout << "output array: ";
    print_arr(ar_out,size);
    */
    //free mem
    free(ar_in);
    free(ar_out);
    return 0;
}

On something like a skylake, `rsqrtps` has 4 cycles latency, but is pipelined so a new `rsqrtps` can be issued every cycle. You could possibly get some speedup by unrolling the loop up to four times, but since the results are stored right away rather than processed further, register renaming and out-of-order execution probably mean that you're not getting much better than what you have now. — EOF, Jul 27 '16 at 20:36
If I write a parallel version of this code, do I have to think about anything except false sharing? — rxu, Jul 27 '16 at 20:45
You'll need to make sure to avoid data-races. For example, if you partition the arrays, make sure there is no overlap. While you'd think that writing the same result to the same place twice would be OK, it is not, unless you declare the array as atomic. — EOF, Jul 27 '16 at 20:49
Your code only appears to require `size` to be a multiple of 4 floats to make the array hold a whole number of vectors of floats. — Peter Cordes, Jul 28 '16 at 03:55
@EOF: `rsqrt` isn't part of a loop-carried dependency chain, so its 4c latency isn't relevant. Register renaming makes it a non-issue. Unrolling is needed for a different reason: loop overhead and fused-domain uop throughput. — Peter Cordes, Jul 28 '16 at 03:57

score 4 · Accepted Answer · edited May 23 '17 at 10:33

How big is your array of floats? If it's already hot in L1 (or maybe L2), gcc5.3 output for this code bottlenecks on uop throughput on modern Intel CPUs, since it makes a loop with 6 fused-domain uops that does one vector per iteration. (So it will run at one vector per 2 cycles).

To achieve 1 vector per clock throughput on modern Intel CPUs, you need the loop to be unrolled (see below for why non-unrolled asm can't work). Having the compiler do it for you is probably good (instead of doing it by hand in the C++ source). e.g. use profile-guided optimization (gcc -fprofile-use), or just blindly use -funroll-loops.

16 bytes per clock is enough in theory to saturate main memory bandwidth with one core. However, IIRC Z Boson has observed better bandwidth from using multiple cores, probably because multiple cores keep more requests outstanding, and a stall on one core doesn't leave memory idle. If the input is hot in L2 on a core, though, it's probably best to process the data using that core.

On Haswell or later, 16 bytes loaded and stored per clock is only half of L1 cache bandwidth, so you need an AVX version of this to max per-core bandwidth.

If you bottleneck on memory, you might want to do a Newton-Raphson iteration to get a nearly-full accuracy 1/sqrt(x), especially if you use multiple threads for a big array. (Because then it's ok if a single thread can't sustain one load+store per clock.)

Or maybe just use rsqrt on the fly when loading this data later. It's very cheap, with high throughput, but still latency similar to an FP add. Again, if it's a big array that doesn't fit in cache, increasing computational intensity by doing fewer separate passes over the data is a big deal. (Cache Blocking aka Loop Tiling would also be a good idea, if you can do so: run multiple steps of your algorithm on a cache-sized chunk of your data.)

Only use cache-bypassing NT stores as a last resort if you can't find a way to make effective use of cache. It's much better if you can convert some data that you're about to use, so it's in cache when it's next used.

The main loop (from .L31 to jne .L31 on the Godbolt compiler explorer) is 6 uops for Intel SnB-family CPUs, because indexed addressing modes don't micro-fuse. (This isn't yet documented in Agner Fog's microarch pdf, unfortunately.)

It's 4 fused-domain uops on Nehalem, with only three ALU uops, so Nehalem should run this at 1 per clock.

.L31:    # the main loop: 6 uops on SnB-family, 4 uops on Nehalem
    rsqrtps xmm0, XMMWORD PTR [rbx+rax]       # tmp127, MEM[base: ar_in_now_10, index: ivtmp.51_61, offset: 0B]
    movaps  XMMWORD PTR [rbp+0+rax], xmm0       # MEM[base: ar_out_now_12, index: ivtmp.51_61, offset: 0B], tmp127
    add     rax, 16   # ivtmp.51,
    cmp     rax, 100000       # ivtmp.51,
    jne     .L31      #,

Since you want to write a separate destination, there's no way to get the loop down to 4 fused-domain uops so it can run at one vector per clock without unrolling. (Both the load and the store need to be one-register addressing modes, so the trick of using src - dst indexed by current_dst instead of incrementing src doesn't work).

Modifying your C++ so gcc would use a pointer increment would only save one uop, because you have to increment src and dst. i.e. float *endp = start + length; and for (p = start ; p < endp ; p+=4) {} would make a loop like

.loop:
    rsqrtps  xmm0, [rsi]
    add      rsi, 16
    movaps   [rdi], xmm0
    add      rdi, 16
    cmp      rdi, rbx
    jne      .loop

Hopefully gcc will do something like this while unrolling, otherwise the rsqrtps + movaps will be 4 fused-domain uops on their own if they still use indexed addressing mode, and no amount of unrolling will make your loop run at one vector per clock.

One interesting thing: According to Agner Fog's instruction tables, you're better off using scalar `rsqrtss` for throughput on Intel Atom, which is hilarious if true. — EOF, Jul 28 '16 at 07:49

Tim · Answer 2 · 2016-07-28T15:08:14.927

3

Because this is a streaming calculation with very low arithmetic intensity, you are almost certainly memory bound. You will likely find a measurable speedup if you use non-temporal loads and stores.

// Hint to the CPU that we don't want to use the cache
// Be sure to update this if you use manual loop unrolling
_mm_prefetch(reinterpret_cast<char*>(ar_in+4), _MM_HINT_NTA);

// Hint to the CPU that we don't need to write back through the cache
_mm_stream_ps(ar_out_now,xmm0);

EDIT:

I ran some tests to see what things look like on different hardware. Unsurprisingly, the results are quite sensitive to the hardware in use. I would posit that it is likely due to the increased number of read/write buffers in modern CPUs.

All code was compiled using gcc-6.1 with

gcc -std=c++14 -O3 -march=native -mavx -mfpmath=sse -ffast-math

Intel Core i3-3120M @ 2.5GHz; 3MB cache

OP's code:               350 +- 10 milliseconds
NTA Prefetch:            360 +- 5 milliseconds
NTA Prefetch+NTA store:  430 +- 10 milliseconds

Intel Core i7-6500U CPU @ 2.50GHz; 4MB cache

OP's code:               205 +- 5 milliseconds
NTA Prefetch:            220 +- 2 milliseconds
NTA Prefetch+NTA store:  200 +- 5 milliseconds

Increasing the problem size to 2MB, the NTA Prefetch+NTA store sees a ~30% decrease in runtime over OP's solution.

Results: The problem size is too small to substantially benefit from NTA. On older architectures, it is detrimental. On newer architectures, it is on par with the OP's solution.

Conclusion: Probably not worth the extra effort in this case.

edited Jul 28 '16 at 15:08

answered Jul 27 '16 at 23:46

Tim

1,457
9
14

1

Mmm... Manual prefetch for linear RAM access pattern doesn't make sense, though again I would like to take a look at measurements – Severin Pappadeux Jul 28 '16 at 00:54
It's less about the act of prefetching and more about hinting to the CPU to keep data out of the cache. If the hint is used, the CPU won't do any evictions. – Tim Jul 28 '16 at 01:29
2

Does `prefetchNTA` really make a difference on WB memory on real hardware? I haven't tested it, but my guess is that they could insert newly-loaded lines at the LRU position instead of the MRU position of set-associative caches. See [the NT loads section of my answer here](http://stackoverflow.com/questions/35516878/acquire-release-semantics-with-non-temporal-stores-on-x64) . Also, note that the OP's problem size of 6250 * 16 floats * 4B/float is only 2/3rds of L2 cache size. Bypassing cache is probably a bad idea here. – Peter Cordes Jul 28 '16 at 03:53
@PeterCordes See my updated post for some numbers. At such a small problem size, it isn't really beneficial and (unsurprisingly) depends on the hardware used. – Tim Jul 28 '16 at 15:09
i3-3120M is an IvyBridge, which has [a performance bug with prefetch instruction throughput](http://www.agner.org/optimize/blog/read.php?i=285) that isn't present in SnB or HSW. See also [Mysticial's confirmation of avoiding prefetch on IvB](http://stackoverflow.com/questions/32103968/non-temporal-loads-and-the-hardware-prefetcher-do-they-work-together#comment52133238_32118466). You'd probably have gotten similar results on SnB or even Nehalem as on your Haswell. – Peter Cordes Jul 28 '16 at 15:35
@PeterCordes I was unaware of the prefetch bug. That's good to know! – Tim Jul 28 '16 at 16:14

Severin Pappadeux · Answer 3 · 2016-07-28T19:51:07.240

1

You have to measure it, of course, but there is known code to compute (not very precise) inverse square root, check https://www.beyond3d.com/content/articles/8/

float InvSqrt (float x) {
    float xhalf = 0.5f*x;
    int i = *(int*)&x;
    i = 0x5f3759df - (i>>1);
    x = *(float*)&i;
    x = x*(1.5f - xhalf*x*x);
    return x;
}

Tested (with VS2015 and GCC 5.4.0) conversion to SSE2, link

__m128 InvSqrtSSE2(__m128 x) {
    __m128 xhalf = _mm_mul_ps(x, _mm_set1_ps(0.5f));

    x = _mm_castsi128_ps(_mm_sub_epi32(_mm_set1_epi32(0x5f3759df), _mm_srai_epi32(_mm_castps_si128(x), 1)));

    return _mm_mul_ps(x, _mm_sub_ps(_mm_set1_ps(1.5f), _mm_mul_ps(xhalf, _mm_mul_ps(x, x))));
}

UPDATE I

Mea culpa! Thanks to @EOF, who noticed improper conversion, those are replaced with casts

edited Jul 28 '16 at 19:51

answered Jul 27 '16 at 21:17

Severin Pappadeux

15,291
3
27
51

4

There's no way that can be faster than `rsqrtps`, the two multiplications alone are already slower – harold Jul 27 '16 at 21:23
@harold I suspect that as well, but would be interested in measurements – Severin Pappadeux Jul 27 '16 at 21:28
1

You realize that a) the code you've cited has undefined behavior in C, and b) the point of the undefined behavior is that it tries to interpret the bitwise-representation of the `float` argument as `int`, in sharp contrast to your proposed code which converts the value? – EOF Jul 27 '16 at 22:43
@EOF `the code you've cited has undefined behavior in C` - sure, but we're talking about very specific platform, where it might work (famous last words). `in sharp contrast to your proposed code which converts the value` - aha, you're right. Updated answer with the cast intrinsic. – Severin Pappadeux Jul 28 '16 at 00:48
re: undefined behaviour: use unions for type-punning in GNU C, not pointer-casts. pointer-casts really do break, and unions are guaranteed to work in gcc, IIRC. – Peter Cordes Jul 28 '16 at 02:46
How many bits of precision does this have, before and after the Newton-Raphson iteration? [`rsqrtps` has an error of about +/-12 bits](http://www.felixcloutier.com/x86/RSQRTPS.html). See also http://stackoverflow.com/a/1528751/224132. – Peter Cordes Jul 28 '16 at 02:49
1

@PeterCordes take a look at the graph at "Accuracy" section in https://en.wikipedia.org/wiki/Fast_inverse_square_root – Severin Pappadeux Jul 28 '16 at 13:49
So according to that, the relative error is on the order of 0.175% after one Newton-Raphson iteration. `log(100/0.175)/log(2) = 9.16` bits. So after an NR iteration, this is only a couple bits more accurate than raw rsqrtps (which only requires SSE). `rsqrtps` + a NR iteration is probably higher throughput than this with the same NR iteration. But for latency, this is neat if 9 bits relative error is ok, but raw `rsqrt` isn't. `psrad` -> `psubd` -> 1 cycle bypass delay from int-vec to fp-vec is two cycle shorter than 5 cycle cycle latency of `rsqrtps` (pre-Skylake), but more uops. – Peter Cordes Jul 28 '16 at 14:14
@PeterCordes One could add another NR iteration for more accuracy, it is pretty much copy of the last line. Yes, more uops and I suspect it will be slower than `_mm_rsqrt_ps`, but we're talking about performance, measurement is the king. – Severin Pappadeux Jul 28 '16 at 14:53
@PeterCordes wrt UB, in C yes, union might be better. With SSE2 there is cast operator which does not map into any instruction/uops - basically, FP vector would be reinterpreted as Int vector and vice versa – Severin Pappadeux Jul 28 '16 at 15:01
For simple sequences that aren't memory bound, static analysis can often predict performance pretty accurately, especially on Intel SnB-family CPUs that usually have few weird slowdowns for code that bottlenecks on execution units or fused-domain uop throughput. See [Agner Fog's microarch pdf](http://agner.org/optimize/). Also, Intel's IACA tool is useful. Doing another two NR iterations would still only be as accurate as `rsqrtps` + one NR iteration, so that's obviously not worth considering (unless you use double-precision floats, to get more than 28 bits of mantissa). – Peter Cordes Jul 28 '16 at 15:01
And yes, intrinsics like `castps_si128` are the way to go for manual vectorization. I was only talking about type-punning the pure C `int` and `float` types. – Peter Cordes Jul 28 '16 at 15:05
BTW, I think you used `_mm_add_epi32` where the C source has a `-`. I didn't really read it carefully, though, so I'll just leave a comment instead of editing myself. – Peter Cordes Jul 28 '16 at 15:09
@PeterCordes shouldn't put out code on the road without testing. Anyway, updated with code tested with VS2015 against Quake version and 1/sqrtf(). Still interested in performance checks though... – Severin Pappadeux Jul 28 '16 at 19:18

Faster approximate reciprocal square root of an array

3 Answers3

Linked