29

I want to write fast simd code to compute the multiplicative reduction of a complex array. In standard C this is:

#include <complex.h>
complex float f(complex float x[], int n ) {
   complex float p = 1.0;
   for (int i = 0; i < n; i++)
      p *= x[i];
   return p;
}

n will be at most 50.

Gcc can't auto-vectorize complex multiplication but, as I am happy to assume the gcc compiler and if I knew I wanted to target sse3 I could follow How to enable sse3 autovectorization in gcc and write:

typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
  v4sf v;
  float e[4];
} float4
typedef struct {
  float4 x;
  float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
  return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x[], int n) {
  v4sf one = {1,1,1,1};
  complex4 p = {one,one};
  for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
  return p;
}

This indeed produces fast vectorized assembly code using gcc. Although you still need to pad your input to a multiple of 4. The assembly you get is:

.L3:
    vmovaps xmm0, XMMWORD PTR 16[rsi]
    add     rsi, 32
    vmulps  xmm1, xmm0, xmm2
    vmulps  xmm0, xmm0, xmm3
    vfmsubps        xmm1, xmm3, XMMWORD PTR -32[rsi], xmm1
    vmovaps xmm3, xmm1
    vfmaddps        xmm2, xmm2, XMMWORD PTR -32[rsi], xmm0
    cmp     rdx, rsi
    jne     .L3

However, it is designed for the exact simd instruction set and is not optimal for avx2 or avx512 for example for which you need to change the code.

How can you write C or C++ code for which gcc will produce optimal code when compiled for any of sse, avx2 or avx512? That is, do you always have to write separate functions by hand for each different width of SIMD register?

Are there any open source libraries that make this easier?

eleanora
  • 9,397
  • 17
  • 58
  • 128
  • The first thing would be to check if some macros reflect the compiler settings. – Antti Haapala Jul 25 '17 at 09:21
  • 2
    I couldn't get anywhere with GCC, but Clang [autovectorizes](https://godbolt.org/g/z2E5wK) if you help it a bit, using the available vector width. – harold Jul 25 '17 at 20:24
  • 2
    If you're looking for a fully generic approach to this that's optimal for all vector sizes, you're not going to get it for a single type like `float4`. Either you make the vector types really large, or you write your code to handle variable sized vectors. – Mysticial Jul 27 '17 at 20:49
  • 1
    You will get better higher performance by unrolling with multiple accumulators. Regardless of vector width, the asm in the loop in your question, it bottlenecks on the loop-carried dependency chains (vmulps / vfmaddps have 4 cycle latency on Skylake, but 0.5c throughput, so you need to expose enough parallelism for the CPU to keep 8 FMAs in flight to saturate the execution units.) Clang usually unrolls with multiple accumulators by default, but gcc doesn't. – Peter Cordes Jul 28 '17 at 00:49
  • @Mystical I think if I cover sse,avx2 and avx512 that should be enough. Could you explain in more detail how your suggestions would work please? The array x will be at most 50 complex numbers long. – eleanora Jul 28 '17 at 07:36
  • @PeterCordes Does this simply involve unrolling the loop or something more sophisticated? – eleanora Jul 28 '17 at 08:28
  • 1
    @eleanora: If the compiler doesn't do it for you, manually unroll the loop and use four different `p` variables. Like `p0=p1=p2=p3 = {one,one};`. Then in the loop, `p0 = complex4_mul(p0, x[i+0]);` `p1 = complex4_mul(p1, x[i+1]);`, etc. At the end, combine the accumulators together. `p0 = complex4_mul(p0, p1);`, same for 2 and 3, then the final down to one vector of results. – Peter Cordes Jul 28 '17 at 08:36
  • With GNU C native-vector syntax, you don't need the union. `v4sf` already supports `[]` for accessing elements. (But generally avoid doing that inside inner loops, because it's just as slow as doing it through the union). – Peter Cordes Jul 31 '17 at 02:06
  • @PeterCordes How would the code look without the union? – eleanora Jul 31 '17 at 04:05
  • 1
    Everywhere you use `float4`, use `v4sf`. (And then you can clean up all the `.v` in the code using it.) – Peter Cordes Jul 31 '17 at 04:07

3 Answers3

12

Here would be an example using the Eigen library:

#include <Eigen/Core>
std::complex<float> f(const std::complex<float> *x, int n)
{
    return Eigen::VectorXcf::Map(x, n).prod();
}

If you compile this with clang or g++ and sse or avx enabled (and -O2), you should get fairly decent machine code. It also works for some other architectures like Altivec or NEON. If you know that the first entry of x is aligned, you can use MapAligned instead of Map.

You get even better code, if you happen to know the size of your vector at compile time using this:

template<int n>
std::complex<float> f(const std::complex<float> *x)
{
    return Eigen::Matrix<std::complex<float>, n, 1> >::MapAligned(x).prod();
}

Note: The functions above directly correspond to the function f of the OP. However, as @PeterCordes pointed out, it is generally bad to store complex numbers interleaved, since this will require lots of shuffling for multiplication. Instead, one should store real and imaginary parts in a way that they can be directly loaded one packet at once.

Edit/Addendum: To implement a structure-of-arrays like complex multiplication, you can actually write something like:

typedef Eigen::Array<float, 8, 1> v8sf; // Eigen::Array allows element-wise standard operations
typedef std::complex<v8sf> complex8;
complex8 prod(const complex8& a, const complex8& b)
{
    return a*b;
}

Or more generic (using C++11):

template<int size, typename Scalar = float> using complexX = std::complex<Eigen::Array<Scalar, size, 1> >;

template<int size>
complexX<size> prod(const complexX<size>& a, const complexX<size>& b)
{
    return a*b;
}

When compiled with -mavx -O2, this compiles to something like this (using g++-5.4):

    vmovaps 32(%rsi), %ymm1
    movq    %rdi, %rax
    vmovaps (%rsi), %ymm0
    vmovaps 32(%rdi), %ymm3
    vmovaps (%rdi), %ymm4
    vmulps  %ymm0, %ymm3, %ymm2
    vmulps  %ymm4, %ymm1, %ymm5
    vmulps  %ymm4, %ymm0, %ymm0
    vmulps  %ymm3, %ymm1, %ymm1
    vaddps  %ymm5, %ymm2, %ymm2
    vsubps  %ymm1, %ymm0, %ymm0
    vmovaps %ymm2, 32(%rdi)
    vmovaps %ymm0, (%rdi)
    vzeroupper
    ret

For reasons not obvious to me, this is actually hidden in a method which is called by the actual method, which just moves around some memory -- I don't know why Eigen/gcc does not assume that the arguments are already properly aligned. If I compile the same with clang 3.8.0 (and the same arguments), it is compiled to just:

    vmovaps (%rsi), %ymm0
    vmovaps %ymm0, (%rdi)
    vmovaps 32(%rsi), %ymm0
    vmovaps %ymm0, 32(%rdi)
    vmovaps (%rdi), %ymm1
    vmovaps (%rdx), %ymm2
    vmovaps 32(%rdx), %ymm3
    vmulps  %ymm2, %ymm1, %ymm4
    vmulps  %ymm3, %ymm0, %ymm5
    vsubps  %ymm5, %ymm4, %ymm4
    vmulps  %ymm3, %ymm1, %ymm1
    vmulps  %ymm0, %ymm2, %ymm0
    vaddps  %ymm1, %ymm0, %ymm0
    vmovaps %ymm0, 32(%rdi)
    vmovaps %ymm4, (%rdi)
    movq    %rdi, %rax
    vzeroupper
    retq

Again, the memory-movement at the beginning is weird, but at least that is vectorized. For both gcc and clang this get optimized away when called in a loop, however:

complex8 f8(complex8 x[], int n) {
    if(n==0)
        return complex8(v8sf::Ones(),v8sf::Zero()); // I guess you want p = 1 + 0*i at the beginning?

    complex8 p = x[0];
    for (int i = 1; i < n; i++) p = prod(p, x[i]);
    return p;
}

The difference here is that clang will unroll that outer loop to 2 multiplications per loop. On the other hand, gcc will use fused-multiply-add instructions when compiled with -mfma.

The f8 function can of course also be generalized to arbitrary dimensions:

template<int size>
complexX<size> fX(complexX<size> x[], int n) {
    using S= typename complexX<size>::value_type;
    if(n==0)
        return complexX<size>(S::Ones(),S::Zero());

    complexX<size> p = x[0];
    for (int i = 1; i < n; i++) p *=x[i];
    return p;
}

And for reducing the complexX<N> to a single std::complex the following function can be used:

// only works for powers of two
template<int size> EIGEN_ALWAYS_INLINE
std::complex<float> redux(const complexX<size>& var) {
    complexX<size/2> a(var.real().template head<size/2>(), var.imag().template head<size/2>());
    complexX<size/2> b(var.real().template tail<size/2>(), var.imag().template tail<size/2>());
    return redux(a*b);
}
template<> EIGEN_ALWAYS_INLINE
std::complex<float> redux(const complexX<1>& var) {
    return std::complex<float>(var.real()[0], var.imag()[0]);
}

However, depending on whether I use clang or g++, I get quite different assembler output. Overall, g++ has a tendency to fail to inline loading the input arguments, and clang fails to use FMA operations (YMMV ...) Essentially, you need to inspect the generated assembler code anyway. And more importantly, you should benchmark the code (not sure, how much impact this routine has in your overall problem).

Also, I wanted to note that Eigen actually is a linear algebra library. Exploiting it for pure portable SIMD code generation is not really what is designed for.

chtz
  • 14,758
  • 4
  • 22
  • 49
  • Suggestion: compile with `-march=haswell` to enable FMA and AVX, if you're actually going to run on Haswell or later. That also enables `-mtune=haswell`, which affects code-gen decisions for things like how to do possibly-unaligned 256b vector loads (`-mtune=generic` loads the two halves separately, `-mtune=haswell` uses `vmovups ymm`) – Peter Cordes Jul 31 '17 at 02:02
  • 1
    Your first example vectorizes, but since it stores things with real and complex interleaved, it has to shuffle. It unfortunately doesn't even manage to use `vfmaddsubps`, instead doing a separate `vmulps` and then `vaddsubps`, so it doesn't use FMA even when compiled with `-march=haswell -ffast-math` (tried gcc7.1 and clang4.0. Clang uses some scalar `vfmaddss` and `vfmsubss` outside the inner loop). – Peter Cordes Jul 31 '17 at 02:18
  • @PeterCordes Yes, the first two examples have to interleave, but they are in fact the only ones similar to the OP's `f` function. The OP's `f4` function not only requires padding the input data, but also requires "manually" interleaving it before calling or storing it in SOA (or "array of structure of packet") form in the first place (which may or may not be a good idea anyway, depending on how the data is generated). – chtz Jul 31 '17 at 06:27
  • 4
    My guess is that the vectorization was done "by hand" in Eigen, not automatically by the compiler (at least gcc has a hard time vectorizing anything involving complex multiplication), and they may not have written specialized code for all possible hardware combinations (fma+avx). You may try and submit a patch if you can measure a significant speed-up, or significantly better precision, from using vfmaddsubps. – Marc Glisse Jul 31 '17 at 07:04
  • @MarcGlisse: gcc can usually fuse mul and add intrinsics into an FMA (especially with `-ffast-math`). Can it do this for mul and `_mm256_addsub_ps()`? I didn't look at the asm carefully to see if it was structured in a way that actually could use an addsub FMA. – Peter Cordes Jul 31 '17 at 07:49
  • 2
    @chtz: I just wanted to point out that recommending a library function that uses an inherently non-SIMD-friendly storage format is not the best way to start an answer. Especially since you don't say anything about that problem. It makes sense to show it, but only as an example of what you could do if you needed to handle plain `complex` data instead of using arranging into vector-sized chunks like the OP seems to be willing to do. You can use that format throughout an application, even for scalar loops, with the right index calculations. Or go full SOA. – Peter Cordes Jul 31 '17 at 07:53
  • @PeterCordes I am more than happy to use data format that will give me maximum speed. – eleanora Jul 31 '17 at 08:27
  • @PeterCordes ok, I see. I'll elaborate on that once I find time, and also try to elaborate on how to implement the final stage of the reduction. Non-interleaved storage happens inside Eigen for tasks like matrix-matrix multiplication (that happens somewhat on-the-fly during the blocking/un-blocking), but it is indeed not part of the public interface. – chtz Jul 31 '17 at 08:41
  • @chtz Do you know how this compares to armadillo? – eleanora Jul 31 '17 at 10:42
  • 1
    @PeterCordes gcc does fusion between * and +- (the intrinsics expand to that) during the GIMPLE phase of optimization. addsub appears as an arbitrary function call at that point. In later RTL passes, fmaddsub appears as an opaque operation (it could be modeled, but we don't do fusion that late). So it would have to be an x86-specific optimization. – Marc Glisse Jul 31 '17 at 11:20
  • @MarcGlisse I am very interested in x86 only optimisations as in practice almost everybody using my code will be on such a system. – eleanora Jul 31 '17 at 12:14
  • @eleanora Regarding armadillo: I never used that, and would not feel qualified to give an opinion based on their documentation alone. – chtz Aug 02 '17 at 22:16
3

If Portability is your main concern, there are many libraries here which provide SIMD instructions in their own syntax. Most of them do the explicit vectorization more simple and portable than intrinsics. This Library (UME::SIMD) is recently published and has a great performance

In this paper(UME::SIMD) an interface based on Vc has been established which is named UME::SIMD. It allows the programmer to access the SIMD capabilities without the need for extensive knowledge of SIMD ISAs. UME::SIMD provides a simple, flexible and portable abstraction for explicit vectorization without performance losses compared to intrinsics

Martin
  • 2,119
  • 1
  • 11
  • 31
  • Thank you. If portability means being really fast if the cpu has only sse, avx, or avx512 then that is indeed my goal. Would you be able to show code for my specified problem using this library? I am still not 100% sure what it would look like to be fast for all three, – eleanora Jul 29 '17 at 05:04
  • Taking UME:SIMD, I don't understand yet what problem it is solving for my question to be honest. Don't you still have to specify the number of elements packed in a vector which leaves the same problem as I had before doesn't it? – eleanora Jul 29 '17 at 05:58
  • Libraries are the way to go. Over in large real time embedded equipment land (radars, etc) the library most commonly found was/is VSIPL. It was pretty weird to use, but quite effective. These guys https://www.mrcy.com/products/software/multicore_mathpack/ are pretty good, very useful if you have an established code base on their hardware going back decades. Anyway, these ecosystems have some old code that is still in use, just recompiled anew, which saves a fortune in long term capability maintenance, thanks to the longevity of the libraries used in this area. – bazza Jul 30 '17 at 18:34
  • I haven't use them, but, in [This](http://dl.acm.org/citation.cfm?id=2568059) article you can see that the library is portable to ARM – Martin Sep 11 '17 at 08:34
1

I don't think you have a fully general solution for this. You can increase your "vector_size" to 32:

typedef float v4sf __attribute__ ((vector_size (32)));

Also increase all arrays to have 8 elements:

typedef float v8sf __attribute__ ((vector_size (32)));

typedef union {
  v8sf v;
  float e[8];
} float8;
typedef struct {
  float8 x;
  float8 y;
} complex8;
static complex8 complex8_mul(complex8 a, complex8 b) {
  return (complex8){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}

This will make the compiler able to generate AVX512 code (don't forget to add -mavx512f), but will make your code slightly worse in SSE by making memory transfers sub-optimal. However, it will certainly not disable SSE vectorization.

You could keep both versions (with 4 and with 8 array elements), switching between them by some flag, but it might be too tedious for little benefit.

anatolyg
  • 23,079
  • 7
  • 51
  • 113
  • This on its own won't do much. You also need to rewrite the code to actually perform the vectorized multiplication. – eleanora Jul 25 '17 at 13:44
  • 1
    I was too lazy to write it in full. Updated my answer now. – anatolyg Jul 27 '17 at 07:53
  • -mavx2 is not the right flag for avx512. In practice the array x is of length between 30 and 50 for me if that makes a difference. – eleanora Jul 27 '17 at 13:11
  • I guess it's called `avx512f`. I have used only `avx2` (which is, in fact, 256-bit), so cannot really say exactly which switch to use. – anatolyg Jul 27 '17 at 13:34
  • 1
    I'd recommend using either `-march=skylake-avx512` or `-march=knl`, depending on which uarch you're actually targeting. That will enable set `-mtune=` appropriate for Skylake or Knight's Landing, as well as enabling AVX512DQ, AVX512VL, and so on for Skylake. (See https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 for which CPUs support which parts of AVX512 beyond the "foundation" common subset.) – Peter Cordes Jul 28 '17 at 00:45
  • @PeterCordes In practical terms I will use -march=native. The code will be open source. – eleanora Jul 28 '17 at 07:37