Unroll loop and do independent sum with vectorization

Question

For the following loop GCC will only vectorize the loop if I tell it to use associative math e.g. with -Ofast.

float sumf(float *x)
{
  x = (float*)__builtin_assume_aligned(x, 64);
  float sum = 0;
  for(int i=0; i<2048; i++) sum += x[i];
  return sum;
}

Here is the assembly with -Ofast -mavx

sumf(float*):
    vxorps  %xmm0, %xmm0, %xmm0
    leaq    8192(%rdi), %rax
.L2:
    vaddps  (%rdi), %ymm0, %ymm0
    addq    $32, %rdi
    cmpq    %rdi, %rax
    jne .L2
    vhaddps %ymm0, %ymm0, %ymm0
    vhaddps %ymm0, %ymm0, %ymm1
    vperm2f128  $1, %ymm1, %ymm1, %ymm0
    vaddps  %ymm1, %ymm0, %ymm0
    vzeroupper
    ret

This clearly shows the loop has been vectorized.

But this loop also has a dependency chain. In order to overcome the latency of the addition I need to unroll and do partial sums at least three times on x86_64 (excluding Skylake which needs to unroll eight times and doing the addition with FMA instructions which need to unroll 10 times on Haswell and Broadwell). As far as I understand I can unroll the loop with -funroll-loops.

Here is the assembly with -Ofast -mavx -funroll-loops.

sumf(float*):
    vxorps  %xmm7, %xmm7, %xmm7
    leaq    8192(%rdi), %rax
.L2:
    vaddps  (%rdi), %ymm7, %ymm0
    addq    $256, %rdi
    vaddps  -224(%rdi), %ymm0, %ymm1
    vaddps  -192(%rdi), %ymm1, %ymm2
    vaddps  -160(%rdi), %ymm2, %ymm3
    vaddps  -128(%rdi), %ymm3, %ymm4
    vaddps  -96(%rdi), %ymm4, %ymm5
    vaddps  -64(%rdi), %ymm5, %ymm6
    vaddps  -32(%rdi), %ymm6, %ymm7
    cmpq    %rdi, %rax
    jne .L2
    vhaddps %ymm7, %ymm7, %ymm8
    vhaddps %ymm8, %ymm8, %ymm9
    vperm2f128  $1, %ymm9, %ymm9, %ymm10
    vaddps  %ymm9, %ymm10, %ymm0
    vzeroupper
    ret

GCC does unroll the loop eight times. However, it does not do independent sums. It does eight dependent sums. That's pointless and no better than without unrolling.

How can I get GCC to unroll the loop and do independent partial sums?

Edit:

Clang unrolls to four independent partial sums even without -funroll-loops for SSE but I am not sure its AVX code is as efficient. The compiler should not need -funroll-loops with -Ofast anyway so it's good to see Clang is doing this right at least for SSE.

Clang 3.5.1 with -Ofast.

sumf(float*):                              # @sumf(float*)
    xorps   %xmm0, %xmm0
    xorl    %eax, %eax
    xorps   %xmm1, %xmm1
.LBB0_1:                                # %vector.body
    movups  (%rdi,%rax,4), %xmm2
    movups  16(%rdi,%rax,4), %xmm3
    addps   %xmm0, %xmm2
    addps   %xmm1, %xmm3
    movups  32(%rdi,%rax,4), %xmm0
    movups  48(%rdi,%rax,4), %xmm1
    addps   %xmm2, %xmm0
    addps   %xmm3, %xmm1
    addq    $16, %rax
    cmpq    $2048, %rax             # imm = 0x800
    jne .LBB0_1
    addps   %xmm0, %xmm1
    movaps  %xmm1, %xmm2
    movhlps %xmm2, %xmm2            # xmm2 = xmm2[1,1]
    addps   %xmm1, %xmm2
    pshufd  $1, %xmm2, %xmm0        # xmm0 = xmm2[1,0,0,0]
    addps   %xmm2, %xmm0
    retq

ICC 13.0.1 with -O3 unrolls to two independent partial sums. ICC apparently assumes associative math with only -O3.

.B1.8: 
    vaddps    (%rdi,%rdx,4), %ymm1, %ymm1                   #5.29
    vaddps    32(%rdi,%rdx,4), %ymm0, %ymm0                 #5.29
    vaddps    64(%rdi,%rdx,4), %ymm1, %ymm1                 #5.29
    vaddps    96(%rdi,%rdx,4), %ymm0, %ymm0                 #5.29
    addq      $32, %rdx                                     #5.3
    cmpq      %rax, %rdx                                    #5.3
    jb        ..B1.8        # Prob 99%                      #5.3

@user3528438, that defeats the whole purpose of having the compiler do this for me. I would only unroll four times anyway and if I have to do unroll by hand I might as well use intrinsics (which is what I would do in practice anyway). ICC incidentally unrolls to two partial sums. ICC is better. — Z boson, Oct 09 '15 at 13:52
I tried with `#pragma omp simd reduction(+:sum) aligned(x:64)` and `-fopenmp`. That definitely did something more but I can't read enough of assembly to tell whether it fixed your issue. Can you? — Gilles, Oct 09 '15 at 13:57
To be fair, I think you're asking too much of the compiler. Give it a few more years maybe? — Mysticial, Oct 09 '15 at 14:06
@Gilles, I'll look into `omp simd`. I have never used it because I did not see the point of it since the compiler already implements SIMD with the right flags. But apperently it does not do as well as I thought. If `omp simd` does independent partial sums then I would have a reason to use it. — Z boson, Oct 09 '15 at 14:17
Cool. Please let me know if it works as I'm really curious about that, and unfortunately, I'm not a great assembly reader. — Gilles, Oct 09 '15 at 14:21
@Gilles, speaking of which I have always wanted something like '#pragma unroll 4'. It's annoying to unroll by hand everytime. That's one reason assembly is awesome. I did that with NASM [here](http://stackoverflow.com/questions/25899395/obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62) using a macro `%rep unroll` — Z boson, Oct 09 '15 at 14:22
@I tried `omp simd` but it made no difference. However, I did notice that I did not have to use `-Ofast`. As far as I can tell `omp simd` with `-O2` is the same as just using `-Ofast`. — Z boson, Oct 09 '15 at 14:31
@Mysticial, or I could just use Clang. I updated my answer with the assembly from Clang and it does the right thing at least for SSE. — Z boson, Oct 09 '15 at 15:03
@EOF Unaligned load/stores have no penalty for aligned addresses since Nehalem. — Mysticial, Oct 09 '15 at 15:10
@EOF, unaligned loads are not really a problem since Nehalem. But anyway I am not very familiar with Clang. Maybe there is a way to tell it to assume the array is aligned. I'm intrigued with Clang now but unfortunately even though it has OpenMP support now [it does not compete with GCC](http://www.phoronix.com/scan.php?page=article&item=llvm_clang_openmp&num=2). I can fix the unrolling myself in GCC but I can't easily replace OpenMP. — Z boson, Oct 09 '15 at 15:10
@Mysticial But the unaligned load *does* preclude just using a memory operand (though it seems `-march=core-avx2` allows unaligned memory operands to `vaddps`). — EOF, Oct 09 '15 at 15:13
@EOF: AVX memory operands don't fault on unaligned. Only `vmovdqa` (and other explicitly-aligned load/store AVX insns) fault on unaligned addresses. — Peter Cordes, Oct 09 '15 at 16:31
@Zboson: Skylake has 4-cycle latency `vaddps`, with a throughput of 2 per cycle. (It drops the 3c FP add unit, and uses the 4-cycle latency FMA units for adds as well as multiplies.) You need 8 vector accumulators to saturate Skylake's FP throughput for add, mul, or fma. I completely agree that it would be really nice if compiler unrolling was smarter about using more accumulators. [clang 3.7 on godbolt](https://goo.gl/YGgN8x) uses 4, but pointlessly unrolls more than that. (uop caches are small, so unroll only as much as needed. gcc only unrolls by default with `-fprofile-use`.) — Peter Cordes, Oct 09 '15 at 16:38
@PeterCordes, thanks for the info on Skylake. I was not aware of that. Where do you get your Skylake info? I could also have used fma (`a+b=a*1.0+b`) with Haswell and Broadwell. Then I would need to unroll 10 times but it would double my throughput. I'll fix my question shortly. — Z boson, Oct 12 '15 at 08:25
@PeterCordes, your code at godbolt is really intersting. Clang does better than the other compilers. But I don't think Clang supports `omp simd`. That's from OpenMP 4.0 and Clang does not support that yet as far as I know. It should warn you about that with `-Wall` but it does not for me but if I use `omp asdf` it gives not warning either. In any case the binaries are identical with and without it. — Z boson, Oct 12 '15 at 08:33
@PeterCordes, incidentally, are you interested in Skylake? I was hugely disappointed in Intel dropping AVX512. My real time ray tracer I wrote in OpenCL for my GTX580 which is six years old still beats every Intel processor I have used by a good margin and this was before I knew much about optimization. Intel only cares about IoT and low power now for the consumer market. They only want to sell scooters now... — Z boson, Oct 12 '15 at 08:35
@Zboson: I saw the news months before launch that AVX512 wasn't going to be in the desktop models. I was disappointed that not all Skylake Xeons will have AVX512, though. Only Skylake-EP cores will, not Xeons based on the quad-core desktop silicon. What I'm disappointed about recently is that they're not making any socketed desktop Skylake chips with eDRAM. So Broadwell-C beats Skylake on some benchmarks, like rar and 7z compression. For some parallelizable workloads, sure GPGPU is the way to go. I'm not that surprised that a GTX580 is still competitive. — Peter Cordes, Oct 12 '15 at 16:57
@Zboson: Skylake insn lat/tput from an AIDA64 run someone linked on realworldtech. I put links into this answer: http://stackoverflow.com/questions/32000917/c-loop-optimization-help-for-final-assignment/32001196#32001196 and this comment: http://stackoverflow.com/questions/32002277/why-does-gcc-or-clang-not-optimise-reciprocal-to-1-instruction-when-using-fast-m/32002316#comment51941322_32002316 — Peter Cordes, Oct 12 '15 at 17:01
@PeterCordes, I was not aware AVX512 was only for Skeylake-EP. That explains why the mobile skylake xeon processors don't have AVX512. — Z boson, Oct 13 '15 at 09:31
@PeterCordes, you mentioned GCC unrolling only with `-fprofile-use`. I have never used this. Do you have experience with this? http://stackoverflow.com/questions/13881292/gcc-profile-guided-optimization-pgo — Z boson, Oct 15 '15 at 10:52
@Zboson: not a lot, but it's generally recommended if you can run the binary made with `-fprofile-generate` in a way that exercises all the code paths that should be fast. (i.e. make sure any loops that should be unrolled actually get run with the profile-generation test data set.) x264's Makefile has a profile-generate option that encodes a sample video (which you have to provide) with a few different encoder settings, then builds with `-fprofile-use`. I assume the design goal is to only unroll hot loops, and leave cold loops more compact. And only inline where useful, etc. — Peter Cordes, Oct 15 '15 at 23:39

score 1 · Answer 1 · answered Oct 09 '15 at 14:14

Some use of gcc intrinsics and __builtin_ produce this:

typedef float v8sf __attribute__((vector_size(32)));
typedef uint32_t v8u32 __attribute__((vector_size(32)));

static v8sf sumfvhelper1(v8sf arr[4])
{
  v8sf retval = {0};
  for (size_t i = 0; i < 4; i++)
    retval += arr[i];
  return retval;
}

static float sumfvhelper2(v8sf x)
{
  v8sf t = __builtin_shuffle(x, (v8u32){4,5,6,7,0,1,2,3});
  x += t;
  t = __builtin_shuffle(x, (v8u32){2,3,0,1,6,7,4,5});
  x += t;
  t = __builtin_shuffle(x, (v8u32){1,0,3,2,5,4,7,6});
  x += t;
  return x[0];
}

float sumfv(float *x)
{
  //x = __builtin_assume_aligned(x, 64);
  v8sf *vx = (v8sf*)x;
  v8sf sumvv[4] = {{0}};
  for (size_t i = 0; i < 2048/8; i+=4)
    {
      sumvv[0] += vx[i+0];
      sumvv[1] += vx[i+1];
      sumvv[2] += vx[i+2];
      sumvv[3] += vx[i+3];
    }
  v8sf sumv = sumfvhelper1(sumvv);
  return sumfvhelper2(sumv);
}

Which gcc 4.8.4 gcc -Wall -Wextra -Wpedantic -std=gnu11 -march=native -O3 -fno-signed-zeros -fno-trapping-math -freciprocal-math -ffinite-math-only -fassociative-math -S turns into:

sumfv:
    vxorps  %xmm2, %xmm2, %xmm2
    xorl    %eax, %eax
    vmovaps %ymm2, %ymm3
    vmovaps %ymm2, %ymm0
    vmovaps %ymm2, %ymm1
.L7:
    addq    $4, %rax
    vaddps  (%rdi), %ymm1, %ymm1
    subq    $-128, %rdi
    vaddps  -96(%rdi), %ymm0, %ymm0
    vaddps  -64(%rdi), %ymm3, %ymm3
    vaddps  -32(%rdi), %ymm2, %ymm2
    cmpq    $256, %rax
    jne .L7
    vaddps  %ymm2, %ymm3, %ymm2
    vaddps  %ymm0, %ymm1, %ymm0
    vaddps  %ymm0, %ymm2, %ymm0
    vperm2f128  $1, %ymm0, %ymm0, %ymm1
    vaddps  %ymm0, %ymm1, %ymm0
    vpermilps   $78, %ymm0, %ymm1
    vaddps  %ymm0, %ymm1, %ymm0
    vpermilps   $177, %ymm0, %ymm1
    vaddps  %ymm0, %ymm1, %ymm0
    vzeroupper
    ret

The second helper function isn't strictly necessary, but summing over the elements of a vector tends to produce terrible code in gcc. If you're willing to do platform-dependent intrinsics, you can probably replace most of it with __builtin_ia32_hadps256().

Unroll loop and do independent sum with vectorization

1 Answers1

Linked