Why does not AVX further improve the performance compared with SSE2?

Question

I am new to the field of SSE2 and AVX. I write the following code to test the performance of both SSE2 and AVX.

#include <cmath>
#include <iostream>
#include <chrono>
#include <emmintrin.h>
#include <immintrin.h>

void normal_res(float* __restrict__ a, float* __restrict__ b, float* __restrict__ c, unsigned long N) {
    for (unsigned long n = 0; n < N; n++) {
        c[n] = sqrt(a[n]) + sqrt(b[n]);
    }
}

void normal(float* a, float* b, float* c, unsigned long N) {
    for (unsigned long n = 0; n < N; n++) {
        c[n] = sqrt(a[n]) + sqrt(b[n]);
    }
}

void sse(float* a, float* b, float* c, unsigned long N) {
    __m128* a_ptr = (__m128*)a;
    __m128* b_ptr = (__m128*)b;

    for (unsigned long n = 0; n < N; n+=4, a_ptr++, b_ptr++) {
        __m128 asqrt = _mm_sqrt_ps(*a_ptr);
        __m128 bsqrt = _mm_sqrt_ps(*b_ptr);
        __m128 add_result = _mm_add_ps(asqrt, bsqrt);
        _mm_store_ps(&c[n], add_result);
    }
}

void avx(float* a, float* b, float* c, unsigned long N) {
    __m256* a_ptr = (__m256*)a;
    __m256* b_ptr = (__m256*)b;

    for (unsigned long n = 0; n < N; n+=8, a_ptr++, b_ptr++) {
        __m256 asqrt = _mm256_sqrt_ps(*a_ptr);
        __m256 bsqrt = _mm256_sqrt_ps(*b_ptr);
        __m256 add_result = _mm256_add_ps(asqrt, bsqrt);
        _mm256_store_ps(&c[n], add_result);
    }
}

int main(int argc, char** argv) {
    unsigned long N = 1 << 30;

    auto *a = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
    auto *b = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
    auto *c = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));

    std::chrono::time_point<std::chrono::system_clock> start, end;
    for (unsigned long i = 0; i < N; ++i) {                                                                                                                                                                                   
        a[i] = 3141592.65358;           
        b[i] = 1234567.65358;                                                                                                                                                                            
    }

    start = std::chrono::system_clock::now();   
    for (int i = 0; i < 5; i++)                                                                                                                                                                              
        normal(a, b, c, N);                                                                                                                                                                                                                                                                                                                                                                                                            
    end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end - start;
    std::cout << "normal elapsed time: " << elapsed_seconds.count() / 5 << std::endl;

    start = std::chrono::system_clock::now();     
    for (int i = 0; i < 5; i++)                                                                                                                                                                                                                                                                                                                                                                                         
        normal_res(a, b, c, N);    
    end = std::chrono::system_clock::now();
    elapsed_seconds = end - start;
    std::cout << "normal restrict elapsed time: " << elapsed_seconds.count() / 5 << std::endl;                                                                                                                                                                                 

    start = std::chrono::system_clock::now();
    for (int i = 0; i < 5; i++)                                                                                                                                                                                                                                                                                                                                                                                              
        sse(a, b, c, N);    
    end = std::chrono::system_clock::now();
    elapsed_seconds = end - start;
    std::cout << "sse elapsed time: " << elapsed_seconds.count() / 5 << std::endl;   

    start = std::chrono::system_clock::now();
    for (int i = 0; i < 5; i++)                                                                                                                                                                                                                                                                                                                                                                                              
        avx(a, b, c, N);    
    end = std::chrono::system_clock::now();
    elapsed_seconds = end - start;
    std::cout << "avx elapsed time: " << elapsed_seconds.count() / 5 << std::endl;   
    return 0;            
}

I compile my program by using g++ complier as the following.

g++ -msse -msse2 -mavx -mavx512f -O2

The results are as the following. It seems that there is no further improvement when I use more advanced 256 bits vectors.

normal elapsed time: 10.5311
normal restrict elapsed time: 8.00338
sse elapsed time: 0.995806
avx elapsed time: 0.973302

I have two questions.

Why does not AVX give me further improvement? Is it because the memory bandwidth?
According to my experiment, the SSE2 perform 10 times faster than the naive version. Why is that? I expect the SSE2 can only be 4 times faster based on its 128 bits vectors with respect to single precision floating points. Thanks a lot.

What CPU are you testing on? Why are you using `-mavx512f` instead of setting tune options as well with `-march=native`? Why are you using `-mavx512f` when your program doesn't use any `__m512` vectors? AVX512F doesn't include AVX512VL, so the compiler couldn't use EVEX encodings or ymm16..31 even if it wanted to. — Peter Cordes, Mar 01 '20 at 17:14

score 3 · Answer 1 · answered Mar 01 '20 at 15:57

There are several issues here....

Memory bandwidth is very likely to be important for these array sizes -- more notes below.
Throughput for SSE and AVX square root instructions may not be what you expect on your processor -- more notes below.
The first test ("normal") may be slower than expected because the output array is instantiated (i.e., virtual to physical mappings are created) during the timed part of the test. (Just fill c with zeros in the loop that initializes a and b to fix this.)

Memory Bandwidth Notes:

With N = 1<<30 and float variables, each array is 4GiB.
Each test reads two arrays and writes to a third array. This third array must also be read from memory before being overwritten -- this is called a "write allocate" or a "read for ownership".
So you are reading 12 GiB and writing 4 GiB in each test. The SSE and AVX tests therefore correspond to ~16 GB/s of DRAM bandwidth, which is near the high end of the range typically seen for single-threaded operation on recent processors.

Instruction Throughput Notes:

The best reference for instruction latency and throughput on x86 processors is "instruction_tables.pdf" from https://www.agner.org/optimize/
Agner defines "reciprocal throughput" as the average number of cycles per retired instruction when the processor is given a workload of independent instructions of the same type.
As an example, for an Intel Skylake core, the throughput of SSE and AVX SQRT is the same:
SQRTPS (xmm) 1/throughput = 3 --> 1 instruction every 3 cycles
VSQRTPS (ymm) 1/throughput = 6 --> 1 instruction every 6 cycles
Execution time for the square roots is expected to be (1<<31) square roots / 4 square roots per SSE SQRT instruction * 3 cycles per SSE SQRT instruction / 3 GHz = 0.54 seconds (randomly assuming a processor frequency).
Expected throughput for the "normal" and "normal_res" cases depends on the specifics of the generated assembly code.

Turns out `normal` and `normal_res` are using `sqrt(double)`; that exactly accounts of the extra factor of 2 slowdown in `normal_res` (which doesn't suffer from page faults). See my answer. — Peter Cordes, Mar 01 '20 at 18:34

Peter Cordes · Accepted Answer · 2020-03-02T04:08:08.977

Scalar being 10x instead of 4x slower:

You're getting page faults in c[] inside the scalar timed region because that's the first time you're writing it. If you did tests in a different order, whichever one was first would pay that large penalty. That part is a duplicate of this mistake: Why is iterating though `std::vector` faster than iterating though `std::array`? See also Idiomatic way of performance evaluation?

normal pays this cost in its first of the 5 passes over the array. Smaller arrays and a larger repeat count would amortize this even more, but better to memset or otherwise fill your destination first to pre-fault it ahead of the timed region.

normal_res is also scalar but is writing into an already-dirtied c[]. Scalar is 8x slower than SSE instead of the expected 4x.

You used sqrt(double) instead of sqrtf(float) or std::sqrt(float). On Skylake-X, this perfectly accounts for an extra factor of 2 throughput. Look at the compiler's asm output on the Godbolt compiler explorer (GCC 7.4 assuming the same system as your last question). I used -mavx512f (which implies -mavx and -msse), and no tuning options, to hopefully get about the same code-gen you did. main doesn't inline normal_res, so we can just look at the stand-alone definition for it.

normal_res(float*, float*, float*, unsigned long):
...
        vpxord  zmm2, zmm2, zmm2    # uh oh, 512-bit instruction reduces turbo clocks for the next several microseconds.  Silly compiler
                                    # more recent gcc would just use `vpxor xmm0,xmm0,xmm0`
...
.L5:                              # main loop
        vxorpd  xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, DWORD PTR [rdi+rbx*4]   # convert to double
        vucomisd        xmm2, xmm0
        vsqrtsd xmm1, xmm1, xmm0                           # scalar double sqrt
        ja      .L16
.L3:
        vxorpd  xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, DWORD PTR [rsi+rbx*4]
        vucomisd        xmm2, xmm0
        vsqrtsd xmm3, xmm3, xmm0                    # scalar double sqrt
        ja      .L17
.L4:
        vaddsd  xmm1, xmm1, xmm3                    # scalar double add
        vxorps  xmm4, xmm4, xmm4
        vcvtsd2ss       xmm4, xmm4, xmm1            # could have just converted in-place without zeroing another destination to avoid a false dependency :/
        vmovss  DWORD PTR [rdx+rbx*4], xmm4
        add     rbx, 1
        cmp     rcx, rbx
        jne     .L5

The vpxord zmm only reduces turbo clock for a few milliseconds (I think) at the start of each call to normal and normal_res. It doesn't keep using 512-bit operations so clock speed can jump back up again later. This might partially account for it not being exactly 8x.

The compare / ja is because you didn't use -fno-math-errno so GCC still calls actual sqrt for inputs < 0 to get errno set. It's doing if (!(0 <= tmp)) goto fallback, jumping on 0 > tmp or unordered. "Fortunately" sqrt is slow enough that it's still the only bottleneck. Out-of-order exec of the conversion and compare/branching means the SQRT unit is still kept busy ~100% of the time.

vsqrtsd throughput (6 cycles) is 2x slower than vsqrtss throughput (3 cycles) on Skylake-X, so using double costs a factor of 2 in scalar throughput.

Scalar sqrt on Skylake-X has the same throughput as the corresponding 128-bit ps / pd SIMD version. So 6 cycles per 1 number as a double vs. 3 cycles per 4 floats as a ps vector fully explains the 8x factor.

The extra 8x vs. 10x slowdown for normal was just from page faults.

SSE vs. AVX sqrt throughput

128-bit sqrtps is sufficient to get the full throughput of the SIMD div/sqrt unit; assuming this is a Skylake-server like your last question, it's 256 bits wide but not fully pipelined. The CPU can alternate sending a 128-bit vector into the low or high half to take advantage of the full hardware width even when you're only using 128-bit vectors. See Floating point division vs floating point multiplication (FP div and sqrt run on the same execution unit.)

See also instruction latency/throughput numbers on https://uops.info/, or on https://agner.org/optimize/.

The add/sub/mul/fma are all 512-bits wide and fully pipelined; use that (e.g. to evaluate a 6th order polynomial or something) if you want something that can scale with vector width. div/sqrt is a special case.

You'd expect a benefit from using 256-bit vectors for SQRT only if you had a bottleneck on the front-end (4/clock instruction / uop throughput), or if you were doing a bunch of add/sub/mul/fma work with the vectors as well.

256-bit isn't worse, but it doesn't help when the only computation bottleneck is on the div/sqrt unit's throughput.

See John McCalpin's answer for more details about write-only costing about the same as a read+write, because of RFOs.

With so little computation per memory access, you're probably close to bottlenecking on memory bandwidth again / still. Even if the FP SQRT hardware was wider / faster, you might not in practice have your code run any faster. Instead you'd just have the core spend more time doing nothing while waiting for data to arrive from memory.

It seems you are getting exactly the expected speedup from 128-bit vectors (2x * 4x = 8x), so apparently the __m128 version is not bottlenecked on memory bandwidth either.

2x sqrt per 4 memory accesses is about the same as the a[i] = sqrt(a[i]) (1x sqrt per load + store) you were doing in the code you posted in chat, but you didn't give any numbers for that. That one avoided the page-fault problem because it was rewriting an array in-place after initializing it.

In general rewriting an array in-place is a good idea if you for some reason keep insisting on trying to get a 4x / 8x / 16x SIMD speedup using these insanely huge arrays that won't even fit in L3 cache.

Memory access is pipelined, and overlaps with computation (assuming sequential access so prefetchers can be pulling it in continuously without having to compute the next address): faster computation doesn't speed up overall progress. Cache lines arrive from memory at some fixed maximum bandwidth, with ~12 cache line transfers in flight at once (12 LFBs in Skylake). Or L2 "superqueue" can track more cache lines than that (maybe 16?), so L2 prefetch is reading ahead of where the CPU core is stalled.

As long as your computation can keep up with that rate, making it faster will just leave more cycles of doing nothing before the next cache line arrives.

(The store buffer writing back to L1d and then evicting dirty lines is also happening, but the basic idea of core waiting for memory still works.)

You could think of it like stop-and-go traffic in a car: a gap opens ahead of your car. Closing that gap faster doesn't gain you any average speed, it just means you have to stop faster.

If you want to see the benefit of AVX and AVX512 over SSE, you'll need smaller arrays (and a higher repeat-count). Or you'll need lots of ALU work per vector, like a polynomial.

In many real-world problems, the same data is used repeatedly so caches work. And it's possible to break up your problem into doing multiple things to one block of data while it's hot in cache (or even while loaded in registers), to increase the computational intensity enough to take advantage of the compute vs. memory balance of modern CPUs.

Good catch on the implicit double precision for C square root. That one still trips me up from time to time.... — John D McCalpin, Mar 02 '20 at 00:30
@JohnDMcCalpin: Until I looked at the asm, I'd been assuming that C++ had an overloaded `sqrt` like it does for templated functions like std::min. And/or just I didn't look closely enough to even think about that mistake. Ah, apparently `std::sqrt` *is* magic, but plain `::sqrt` is *not*. https://godbolt.org/z/a5CwMa shows the difference. Also, GCC aggressively looks for chances to avoid converting to double and back. e.g. `float tmp = sqrt(x)` optimizes it away. So you can get away with it if you don't use the double return value in a larger expression before rounding back to float. — Peter Cordes, Mar 02 '20 at 02:12

Why does not AVX further improve the performance compared with SSE2?

2 Answers2

SSE vs. AVX sqrt throughput