Horizontal sum of 32-bit floats in 256-bit AVX vector

Question

I have two arrays of floats and I would like to calculate the dot product, using SSE and AVX, in the lowest latency possible. I am aware there is a 256-bit dot product intrinsic for floats but I have read on SO that this is slower than the below technique: (https://stackoverflow.com/a/4121295/997112).

I have done most of the work, the vector temp_sums contains all the sums, I just need to sum all the eight 32-bit sums contained within temp_sum at the end.

#include "xmmintrin.h"
#include "immintrin.h"

int main(){
    const int num_elements_in_array = 16;
    __declspec(align(32)) float x[num_elements_in_array];
    __declspec(align(32)) float y[num_elements_in_array];

    x[0] = 2;   x[1] = 2;   x[2] = 2;   x[3] = 2;
    x[4] = 2;   x[5] = 2;   x[6] = 2;   x[7] = 2;
    x[8] = 2;   x[9] = 2;   x[10] = 2;  x[11] = 2;
    x[12] = 2;  x[13] = 2;  x[14] = 2;  x[15] = 2;

    y[0] = 3;   y[1] = 3;   y[2] = 3;   y[3] = 3;
    y[4] = 3;   y[5] = 3;   y[6] = 3;   y[7] = 3;
    y[8] = 3;   y[9] = 3;   y[10] = 3;  y[11] = 3;
    y[12] = 3;  y[13] = 3;  y[14] = 3;  y[15] = 3;

    __m256 a;
    __m256 b;
    __m256 temp_products;   
    __m256 temp_sum = _mm256_setzero_ps();

    unsigned short j = 0;
    const int sse_data_size = 32;
    int num_values_to_process = sse_data_size/sizeof(float);

    while(j < num_elements_in_array){
        a = _mm256_load_ps(x+j);
        b = _mm256_load_ps(y+j);

        temp_products = _mm256_mul_ps(b, a);
        temp_sum = _mm256_add_ps(temp_sum, temp_products);

        j = j + num_values_to_process;
    }

    //Need to "process" temp_sum as a final value here

}

I am worried the 256-bit intrinsics I require are not available up to AVX 1.

This is how I would do it http://stackoverflow.com/questions/13879609/horizontal-sum-of-8-packed-32bit-floats/18616679#18616679 — Z boson, Apr 22 '14 at 11:27

Marat Dukhan · Answer 1 · 2014-04-21T05:13:13.173

4

I would suggest to use 128-bit AVX instructions whenever possible. It will reduce the latency of one cross-domain shuffle (2 cycles on Intel Sandy/Ivy Bridge) and improve efficiency on CPUs which run AVX instructions on 128-bit execution units (currently AMD Bulldozer, Piledriver, Steamroller, and Jaguar):

static inline float _mm256_reduce_add_ps(__m256 x) {
    /* ( x3+x7, x2+x6, x1+x5, x0+x4 ) */
    const __m128 x128 = _mm_add_ps(_mm256_extractf128_ps(x, 1), _mm256_castps256_ps128(x));
    /* ( -, -, x1+x3+x5+x7, x0+x2+x4+x6 ) */
    const __m128 x64 = _mm_add_ps(x128, _mm_movehl_ps(x128, x128));
    /* ( -, -, -, x0+x1+x2+x3+x4+x5+x6+x7 ) */
    const __m128 x32 = _mm_add_ss(x64, _mm_shuffle_ps(x64, x64, 0x55));
    /* Conversion to float is a no-op on x86-64 */
    return _mm_cvtss_f32(x32);
}

edited Apr 21 '14 at 05:13

answered Apr 21 '14 at 03:03

Marat Dukhan

11,245
4
24
41

Is _mm_cvtf128_f32 correct? I cannot see it on the intel intrinsic guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ – user997112 Apr 21 '14 at 03:09
Yes, it is supported by all major compilers (`icc`, `gcc`, `clang`, `msvc`) – Marat Dukhan Apr 21 '14 at 03:30
Thanks. You said "I would suggest to use 128-bit AVX instructions whenever possible". I didnt think it would be possible to use 128-bit instructions on 256-bit registers. What is the general rule for when this can be done? – user997112 Apr 21 '14 at 04:20
Are you sure its supported by all compilers? I am using ICC 13 and it doesn't compile- a google search doesn't show it either... – user997112 Apr 21 '14 at 04:56
1

You're right, the intrinsic should be called `_mm_cvtss_f32` – Marat Dukhan Apr 21 '14 at 05:12
@MaratDukhan "and improve efficiency on CPUs which run AVX instructions on 128-bit execution units" do you mean that on these CPUs that it's less efficient to use AVX then to use SSE? Is this because AVX is more restrictive (ties two 128-bit lanes together rather than letting them be independent). – Z boson Apr 22 '14 at 11:23
1

@Zboson On Bulldozer AVX-256 is often less efficient than AVX-128 due to flaws in instruction decoder. On other processors AVX is more efficient due to less stress on instruction decoders (often the bottleneck) even though AVX-256 instructions are internally decomposed into 2 microoperations. – Marat Dukhan Apr 22 '14 at 15:40
Won't using sse instructions in AVX context introduce a quite large penalty without a vzeroupper? – Pixelchemist Nov 09 '16 at 15:38
1

It would, but 128-bit SSE intrinsics would generate 128-bit AVX instructions rather than SSE instructions, when targeting AVX instruction sets – Marat Dukhan Nov 21 '16 at 03:12

Paul R · Answer 2 · 2014-04-21T02:34:50.463

3

You can emulate a full horizontal add with AVX (i.e. a proper 256 bit version of _mm256_hadd_ps) like this:

#define _mm256_full_hadd_ps(v0, v1) \
        _mm256_hadd_ps(_mm256_permute2f128_ps(v0, v1, 0x20), \
                       _mm256_permute2f128_ps(v0, v1, 0x31))

If you're just working with one input vector then you may be able to simplify this a little.

edited Apr 21 '14 at 02:34

answered Apr 21 '14 at 02:29

Paul R

195,989
32
353
519

Thanks for your answer. I am working with just the one vector- how would this simplify? Is the latency low? – user997112 Apr 21 '14 at 03:12
You'd probably want to simplify it in the context of whatever else you're doing at the same time (in this case presumably just a horizontal reduction sum). The above implementation is a generic replacement for the native `_mm256_hadd_ps` which behaves as you might expect for a full 256 bit SIMD implementation (rather than the 2x128 bit SIMD kludge that you get with AVX whenever horizontal operations are involved). It's been tested, and I suggest using it "as is" for now, and consider simplifying/optimising it later only if needed. – Paul R Apr 21 '14 at 03:41

Horizontal sum of 32-bit floats in 256-bit AVX vector

2 Answers2

Linked