horizontal sum of 8 packed 32bit floats

Question

If I have 8 packed 32-bit floating point numbers (__m256), what's the fastest way to extract the horizontal sum of all 8 elements? Similarly, how to obtain the horizontal maximum and minimum? In other words, what's the best implementation for the following C++ functions?

float sum(__m256 x);  ///< returns sum of all 8 elements
float max(__m256 x);  ///< returns the maximum of all 8 elements
float min(__m256 x);  ///< returns the minimum of all 8 elements

[Here is a link](http://stackoverflow.com/questions/10833234/4-horizontal-double-precision-sums-in-one-go-with-avx/10834041#10834041) to a previous question regarding calculation of the horizontal sum of packed `double`s. You should be able to adapt it to your `float` case also. It's most efficient if you have multiple `__m256` elements that you want to calculate the sum of in parallel. — Jason R, Dec 14 '12 at 14:03
@JasonR sorry, but this doesn't help: it's quite a different problem. — Walter, Dec 14 '12 at 15:07
How is it quite different? You're going to need to use horizontal adds and permutations in order to line up the terms that you want to add, as shown in the other question. You can use a similar structure for the `min` and `max` operations also. I understand it's not a full answer (hence the comment), but it should get you started. — Jason R, Dec 14 '12 at 15:15
@JasonR well, yes, it's not completely useless, but there are many similar problems which all use shuffles and permutations in conjunction with horizontal and vertical operations. btw, there is no horizontal min/max, is there? — Walter, Dec 14 '12 at 15:51
I don't know of a horizontal min/max operation. One method that can get you both min/max simultaneously is to use an in-register sorting network to sort the elements inside a SIMD register. An algorithm suitable for implementation on `__m128`'s can be found in [this paper](http://webdocs.cs.ualberta.ca/~niewiado/TR07-02.pdf); it takes ~15 instructions or so. The way that the YMM registers are implemented on x86 probably makes the job of sorting a `__m256` harder since you can't cross the 128-bit boundary for the most part. — Jason R, Dec 14 '12 at 15:59
@PeterCordes I see that you have been using Agner Fog's C++ Vector Class Library. Is this working well? In fact, I have my own similar (but much less extensive) library and wonder whether it would be better to abandon it in favour of Fog's. — Walter, Nov 09 '16 at 21:31
@Walter: I haven't used it much. I worked on improving it for a while, but haven't got back to cleaning up my changes into nice git commits. It does generally compile to nice asm, though, and seems well designed. I'd certainly recommend it for projects where a GPLed library can be used. (It's the full GPL, not LGPL, so only GPL-compatible projects can use it.) — Peter Cordes, Nov 09 '16 at 21:37

score 7 · Accepted Answer · answered Dec 17 '12 at 21:28

Quickly jotted down here (and hence untested):

float sum(__m256 x) {
    __m128 hi = _mm256_extractf128_ps(x, 1);
    __m128 lo = _mm256_extractf128_ps(x, 0);
    lo = _mm_add_ps(hi, lo);
    hi = _mm_movehl_ps(hi, lo);
    lo = _mm_add_ps(hi, lo);
    hi = _mm_shuffle_ps(lo, lo, 1);
    lo = _mm_add_ss(hi, lo);
    return _mm_cvtss_f32(lo);
}

For min/max, replace _mm_add_ps and _mm_add_ss with _mm_max_* or _mm_min_*.

Note that this is a lot of work for a few operations; AVX isn't really intended to do horizontal operations efficiently. If you can batch up this work for multiple vectors, then more efficient solutions are possible.

score 5 · Answer 2 · answered Sep 04 '13 at 14:33

5

While Stephen Canon's answer is probably ideal for finding the horizontal maximum/minimum I think a better solution can be found for the horizontal sum.

float horizontal_add (__m256 a) {
    __m256 t1 = _mm256_hadd_ps(a,a);
    __m256 t2 = _mm256_hadd_ps(t1,t1);
    __m128 t3 = _mm256_extractf128_ps(t2,1);
    __m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3);
    return _mm_cvtss_f32(t4);        
}

answered Sep 04 '13 at 14:33

Z boson

29,230
10
105
195

2

Note that `VHADDPS` has 5 cycle latency on Sandy Bridge/Ivy Bridge, so it's possible that this may actually be less efficient than Stephen Canon's implementation (where all the instructions are typically 1 cycle latency). – Paul R Aug 04 '15 at 14:56
1

@PaulR, you might be right. But in any case horizontal operations are not something that should be done each iteration of a critical loop anyway. – Z boson Aug 20 '15 at 15:12
Sure - I just thought it was worth noting that this might be one of those counter-intuitive case where fewer instructions might not equate to faster execution. – Paul R Aug 20 '15 at 16:03
2

@PaulR, I admit that I put too much emphasis on number of instructions in the past. Nowadays I would look at the overall latency and throughput and use something like IACA (and test in my application). But in any case I think I came up with this solution from Agner Fog's VCL (which is how I learned SSE and AVX) initially. If you you had to bet on a solution between Agner Fog and Stephen Canon how would you bet? I think I would flip a coin. – Z boson Sep 28 '15 at 09:01
I've been wrong too many times to bet on such things. ;-) I tend to lean towards an empirical approach mostly though - why speculate when you can benchmark ? – Paul R Sep 28 '15 at 13:05
2

@PaulR, congrats on the simd gold tag! – Z boson May 27 '16 at 14:58
@PaulR, BTW, I suspect that Agner downvoted my answer a few days ago and upvoted Stephen's. I guess he did not like my joke or at least he made it clear who he would bet on. – Z boson May 27 '16 at 15:09
I would think any kind of vote from Agner, be it up or down, would be a badge of honour! (Have an up-vote from me anyway, by way of compensation.) – Paul R May 27 '16 at 15:11
@Zboson: That was my downvote, because I think the hadd idiom [is usually the wrong choice](http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026). I found this answer while working on a cleanup of that answer I just linked. It's really hard to say that one way is always better, and it takes a really long time to look into the advantages and disadvantages of different ways of doing the same thing. However in this case, failing to start with an `extract128` makes this obviously worse (e.g. for Bulldozer) for no benefit on others. – Peter Cordes May 27 '16 at 15:12
@PeterCordes, in that case you might want to write Agner Fog (again) since I copied his code from the VCL. See the file `vectorf256.h` and the function `horizontal_add` for `Vec8f`. I am aware that `hadd` is bad. In fact this subject was [my first question on SO](https://stackoverflow.com/questions/14967969/efficient-4x4-matrix-vector-multiplication-with-sse-horizontal-add-and-dot-prod). – Z boson May 30 '16 at 08:12
1

@Zboson: See http://www.agner.org/optimize/vectorclass/read.php?i=124. I made some improvements to integer hsums. – Peter Cordes Jun 03 '16 at 17:21
1

@PeterCordes, awesome that you are doing this! the VCL will get better because of you. But why optimize the horizontal operations? Horizontal operations in a critical loop are usually an indication of a inefficient implementation. I think optimizing 64-bit mult makes more sense because I could imagine needing that. Oh and congrats to the x86 gold tag! – Z boson Jun 04 '16 at 09:01
1

@Zboson: I'm optimizing hsums because they were the first thing I saw that was obviously sub-optimal. I haven't looked at the other complicated functions yet (like the template meta-programming to choose shuffles). And yeah, looking forward to wielding to dupe-hammer to good effect. Now it will actually feel like it's worth searching for duplicates to bad questions. – Peter Cordes Jun 04 '16 at 14:03

Pixelchemist · Answer 3 · 2016-11-09T20:25:00.493

I tried to write code that avoids mixing avx and non-avx instructions and the horizontal sum of an avx register containing floats can be done avx-only by

1x vperm2f128,
2x vshufps and
3x vaddps,

resulting in a register where all entries contain the sum of all elements in the original register.

// permute
//  4, 5, 6, 7, 0, 1, 2, 3
// add
//  0+4, 1+5, 2+6, 3+7, 4+0, 5+1, 6+2, 7+3
// shuffle
//  1+5, 0+4, 3+7, 2+6, 5+1, 4+0, 7+3, 6+2
// add
//  1+5+0+4, 0+4+1+5, 3+7+2+6, 2+6+3+7, 
//  5+1+4+0, 4+0+5+1, 7+3+6+2, 6+2+7+3
// shuffle
//  3+7+2+6, 2+6+3+7, 1+5+0+4, 0+4+1+5, 
//  7+3+6+2, 6+2+7+3, 5+1+4+0, 4+0+5+1
// add
//  3+7+2+6+1+5+0+4, 2+6+3+7+0+4+1+5, 1+5+0+4+3+7+2+6, 0+4+1+5+2+6+3+7,
//  7+3+6+2+5+1+4+0, 6+2+7+3+4+0+5+1, 5+1+4+0+7+3+6+2, 4+0+5+1+6+2+7+3

static inline __m256 hsums(__m256 const& v)
{
    auto x = _mm256_permute2f128_ps(v, v, 1);
    auto y = _mm256_add_ps(v, x);
    x = _mm256_shuffle_ps(y, y, _MM_SHUFFLE(2, 3, 0, 1));
    x = _mm256_add_ps(x, y);
    y = _mm256_shuffle_ps(x, x, _MM_SHUFFLE(1, 0, 3, 2));
    return _mm256_add_ps(x, y);
}

Obtaining the value is then easy by using _mm256_castps256_ps128 and _mm_cvtss_f32:

static inline float hadd(__m256 const& v)
{
    return _mm_cvtss_f32(_mm256_castps256_ps128(hsums(v)));
}

I did some basic benchmarks against the other solutions with __rdtscp and did not find one to be superior in terms of mean cpu cycle count on my Intel i5-2500k.

Looking at the Agner Instruction Tables I found (for Sandy-Bridge processors):

                µops    lat.    1/tp    count

this:

vperm2f128      1       2       1       1
vaddps          1       3       1       3
vshufps         1       1       1       2

sum             6       13      6       6

Z boson:

vhaddps         3       5       2       2
vextractf128    1       2       1       1
addss           1       3       1       1

sum             8       15      6       4

Stephen Canon:

vextractf128    1       2       1       1
addps           1       3       1       2
movhlps         1       1       1       1
shufps          1       1       1       1
addss           1       3       1       1

sum             8       13      6       6

where to me (due to the values being rather similar) none is clearly superior (as I cannot forsee whether instruction count, µop count, latency or throughput matters most). edit, note: The potential problem I assumed to exist in the following is not true. I suspected, that -if having the result in the ymm register is sufficient- my hsums could be useful as it doesn't require vzeroupper to prevent state switching penalty and can thus interleave / execute concurrently with other avx computations using different registers without introducing some kind of sequence point.

`__m128` intrinsics still use the AVX 3-operand VEX-coded version when you compile with AVX support enabled. You're right that the ABI requires a *stand-alone* version of `float hsum(__m256)` to include a VZEROUPPER, but you always want it to inline anyway. In the SysV ABI, all XMM/YMM/ZMM regs are call-clobbered, so a caller would have to spill everything, regardless of the function returning __m256 or float. (And Windows only has a few call-preserved XMM regs, and that's only the low halves, no call-preserved YMM regs.) — Peter Cordes, Nov 09 '16 at 19:51
In the count for Stephen Canon's answer, you missed the VEXTRACTF128 of the upper half. Both your functions should be equivalent: one lane crossing shuffle and two in-lane shuffles, and 3 FP adds. Except that Stephen's will run faster on AMD Bulldozer-family, or other CPUs with only 128b execution units (so `vaddps ymm, ymm, ymm` is slower than `vaddps xmm, xmm, xmm`). — Peter Cordes, Nov 09 '16 at 19:59
See also [my hsums answer](http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86), where the AVX section uses vextractf128, vmovshdup, and vmovhlps, which is equivalent to Stephen's but saves an instruction byte, because those shuffles don't need an imm8 control operand. — Peter Cordes, Nov 09 '16 at 20:00
Your way is the right approach if it's useful to have the hsum broadcast to every element of a `__m256`, though. Sometimes that might be useful, and doing it this way is definitely better than using a separate VBROADCASTSS (especially without AVX2, since the AVX1 version only exists as a load). On a CPU where YMM ops decode to 2 m-ops / uops (like Bulldozer), it might be worth looking for an alternative. Like maybe producing the result broadcast across a `__m128`, then duplicating that to the upper lane. — Peter Cordes, Nov 09 '16 at 20:03

score -1 · Answer 4 · answered Jun 03 '14 at 19:27

-1

union ymm {
    __m256 m256;
    struct {
        __m128 m128lo;
        __m128 m128hi;
    };
};

union ymm result = {1,2,3,4,5,6,7,8};
__m256 a = {9,10,11,12,13,14,15,16};

result.m256 = _mm256_add_ps (result.m256, a);
result.m128lo = _mm_hadd_ps (result.m128lo, result.m128hi);
result.m128lo = _mm_hadd_ps (result.m128lo, result.m128hi);
result.m128lo = _mm_hadd_ps (result.m128lo, result.m128hi);

answered Jun 03 '14 at 19:27

SVladimir

1

4

please add explanation and detail how you figure this would be the answer to the question – njzk2 Jun 03 '14 at 19:55

horizontal sum of 8 packed 32bit floats

4 Answers4

Linked