Vectorization flags with Eigen and IPOPT

Question

I have some C++ function that I am optimizing with IPOPT. Although the cost function, constraint functions, etc. are written in C++, the code was originally written to use the C-interface. I haven't bothered to change that yet unless it turns out to be the issue.

Anyway... We are observing some unexpected behavior where the optimizer converges differently when we compile the program with/without vectorization flags. Specifically, in the CMakeLists file, we have

set(CMAKE_CXX_FLAGS "-Wall -mavx -mfma")

When we run the optimizer with these settings, then the optimizer converges in approximately 100 iterations. So far, so good.

However, we have reason to believe that when compiled for ARM (Android specifically), there is no vectorization occurring because the performance is drastically different than on an Intel processor. The Eigen documentation says that NEON instructions should always be enabled for 64-bit ARM, but we have reason to suspect that that is not occurring. Anyway, that is not the question here.

Due to this suspicion, we wanted to see how bad the performance would be on our Intel processor if we disabled vectorization. This should give us some indication of how much vectorization is occurring, and how much improvement we might expect to see in ARM. However, when we change the compiler flags to

set(CMAKE_CXX_FLAGS "-Wall")

(or to just to the case where we just use AVX (without fma)), then we get the same general solution from the optimizer, but with very different converge performance. Specifically, without vectorization, the optimizer takes about 500 iterations to converge to the solution.

So in summary:

With AVX and FMA      : 100 iterations to converge
With AVX              : 200 iterations to converge
Without AVX and FMA   : 500 iterations to converge

We are literally only changing that one line in the cmake file, not the source code.

I would like some suggestions for why this may be occurring.

My thoughts and more background info:

It seems to me that either the version with or without vectorization must be doing some rounding, and that is making IPOPT converge differently. I was under the impression that adding AVX and FMA flags would not change the output of the functions, but rather only the time it takes to compute them. I appear to be wrong.

The phenomenon we are observing appears particularly strange to me because on one hand we are observing that the optimizer always converges to the same solution. This somehow suggests that the problem can't be too ill-conditioned. However, on the other hand, the fact that the optimizer is behaving differently with/without vectorization flags suggests that the problem IS indeed sensitive to whatever small residuals are generated by vectorized instructions.

One other thing to keep in mind is that we precompiled IPOPT into a library, and are simply linking our code against that precompiled library. So I don't think that the AVX and FMA flags can be affecting the optimizer itself. That seems to mean that our functions must be outputting values with tangibly different values depending on whether vectorization is enabled.

For those interested, here is the full cmake file

cmake_minimum_required(VERSION 3.5)

# If a build type is not passed to cmake, then use this...
if(NOT CMAKE_BUILD_TYPE)
    # set(CMAKE_BUILD_TYPE Release)
    set(CMAKE_BUILD_TYPE Debug)
endif()

# If you are debugging, generate symbols.
set(CMAKE_CXX_FLAGS_DEBUG "-g")

# If in release mode, use all possible optimizations
set(CMAKE_CXX_FLAGS_RELEASE "-O3")

# We need c++11
set(CMAKE_CXX_STANDARD 11)

# Show us all of the warnings and enable all vectorization options!!!
# I must be crazy because these vectorization flags seem to have no effect.
set(CMAKE_CXX_FLAGS "-Wall -mavx -mfma")

if (CMAKE_SYSTEM_NAME MATCHES "CYGWIN")
    include_directories(../../Eigen/
            /cygdrive/c/coin/windows/ipopt/include/coin/
            /cygdrive/c/coin/windows/ipopt/include/coin/ThirdParty/)
    find_library(IPOPT_LIBRARY ipopt HINTS /cygdrive/c/coin/windows/ipopt/lib/)
else ()
    include_directories(../../Eigen/
            ../../coin/CoinIpopt/build/include/coin/
            ../../coin/CoinIpopt/build/include/coin/ThirdParty/)
    find_library(IPOPT_LIBRARY ipopt HINTS ../../coin/CoinIpopt/build/lib/)
endif ()

# Build the c++ functions into an executable
add_executable(trajectory_optimization main.cpp)

# Link all of the libraries together so that the C++-executable can call IPOPT
target_link_libraries(trajectory_optimization ${IPOPT_LIBRARY})

Do your other options include `-ffast-math`? That would make the difference between AVX and no-AVX more easily understandable. — Peter Cordes, Mar 23 '18 at 10:28
No. I realize that fast math will round, but I do not explicitly enable it — bremen_matt, Mar 23 '18 at 10:29
FP math always rounds. But `-ffast-math` would allow the compiler to generate asm that rounds *differently* than the C source, i.e. change the order of operations so you have different temporaries. Without it, there is still some (but not much, especially without x87 80-bit FP registers) scope for the compiler to do anything that will give different numerical results. — Peter Cordes, Mar 23 '18 at 10:32
BTW, you should use `-march=native` for best performance, not just `-mavx -mfma`. You want gcc to *tune* specifically for your CPU, as well as just enabling some instruction-sets that it supports but still tuning for generic CPUs. You are using `-O3`, right? — Peter Cordes, Mar 23 '18 at 10:32
Correct. We use 03 for the release build. I use `-g` for debug, and to be able to profile with Callgrind. However, the results above are in release mode — bremen_matt, Mar 23 '18 at 10:34
re: ARM *because the performance is drastically different than on an Intel processor*. That's normal. Mainstream Intel CPUs have massively powerful SIMD FPU hardware, and large / fast caches. e.g. L1d cache can sustain 2 loads + 1 store per clock cycle, of 256 bit vectors, and that's a big deal for matmul. You could see if you get any worse performance from compiling for AArch64 with `-fno-tree-vectorize`, but if all the time is spent inside a library that won't matter. It's possible your library has hand-tuned x86 code but just generic AArch64, though. — Peter Cordes, Mar 23 '18 at 10:38
That's a good point. I did not think that perhaps the library is somehow tuned for x86 — bremen_matt, Mar 23 '18 at 10:39
If you disassemble your AArch64 binaries, you can check whether they're using SIMD instructions on 128-bit `q` registers like `q0` through `q31`, or whether they're using `s` (32-bit single) or `d` (64-bit double) registers. On AArch64, s3 is the low 32 bits of q3, unlike ARM32 where it's the high 32 bits of q0, in case you're wondering about how partial registers work on ARM. If the AArch64 code is using SIMD but getting much less than the expected throughput for the hardware's SIMD FPU, maybe the source used SIMD intrinsics. gcc/clang are terrible with SIMD intrinsics for ARM, unlike x86. — Peter Cordes, Mar 23 '18 at 10:53

chtz · Accepted Answer · 2018-03-23T12:57:57.270

2

Enabling FMA will result in different rounding behavior, which can lead to very different results, if your algorithm is not numerically stable. Also, enabling AVX in Eigen will result in different order of additions and since floating point math is non-associative, this can also lead to slightly different behavior.

To illustrate why non-associativity can make a difference, when adding 8 consecutive doubles a[8] with SSE3 or with AXV, Eigen will typically produce code equivalent to the following:

// SSE:
double t[2]={a[0], a[1]};
for(i=2; i<8; i+=2)
   t[0]+=a[i], t[1]+=a[i+1]; // addpd
t[0]+=t[1];                  // haddpd

// AVX:
double t[4]={a[0],a[1],a[2],a[3]};
for(j=0; j<4; ++j) t[j]+=a[4+j]; // vaddpd
t[0]+=t[2]; t[1]+=t[3];          // vhaddpd
t[0]+=t[1];                      // vhaddpd

Without more details it is hard to tell what exactly happens in your case.

edited Mar 23 '18 at 12:57

answered Mar 23 '18 at 10:26

chtz

14,758
4
22
49

This a large optimization problem. It could be that there is some ringing occurring in the solution. Other than that, this problem just boils down to a few matrix multiplications. – bremen_matt Mar 23 '18 at 10:32
Without `-ffast-math`, the compiler has very limited scope for changing numerical results, other than FP contraction into FMA, and what it evals at compile time. Hardware FP add and mul operations [*are* commutative except for the payload of NaN](https://stackoverflow.com/questions/5007400/is-multiplication-always-commutative-in-inexact-floating-point-arithmetic). The C++ `+` operator [is not](https://stackoverflow.com/questions/24442725/is-floating-point-addition-commutative-in-c), but it's not at all obvious why AVX vs. SSE2 would matter. – Peter Cordes Mar 23 '18 at 10:49
FP math is not *associative*, but compilers aren't allowed to pretend that it is unless you use `-ffast-math`. – Peter Cordes Mar 23 '18 at 10:49
@PeterCordes Eigen will use different compilation paths depending on whether AVX/FMA/... is available or not. (Non-commutative was a typo, I meant associative, of course) – chtz Mar 23 '18 at 12:44
Hmm. Well that is strange. I definitely am not enabling fast-math. I updated my question with the full cmake file. – bremen_matt Mar 23 '18 at 13:45
Oh, I missed that the OP was using Eigen. I thought they said all the heavy lifting was in *pre-compiled* libraries, and thus there wouldn't be any different vectorization depending on how the program was compiled. Manual vectorization with preprocessor stuff would totally explain it! (But hopefully compilers are smart enough not to use `haddpd` there, because it's not optimal for performance, only code-size. https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86). This is now a *much* better answer, nice edit. – Peter Cordes Mar 23 '18 at 13:48
@bremen_matt: It's not strange. Eigen uses different source code depending on instruction set available, so the behaviour that the compiler has to preserve (without `-ffast-math`) is different between those cases. Using SIMD for reductions requires pretending that FP math is associative, but usually the interleaved order of operations you get from SIMD is no worse or better than accumulating each element in order. – Peter Cordes Mar 23 '18 at 13:53
2

@bremen_matt: related: [Associativity gives us parallelizability. But what does commutativity give?](https://stackoverflow.com/questions/35443424/associativity-gives-us-parallelizability-but-what-does-commutativity-give) has a nice link in the question with some background about how associativity gives parallelism. – Peter Cordes Mar 23 '18 at 13:55
@PeterCordes It seems Eigen indeed avoids `hadd` instructions (at least when reducing one package to a single value: https://bitbucket.org/eigen/eigen/src/9e00a4b6ce/Eigen/src/Core/arch/SSE/PacketMath.h#PacketMath.h-586 – chtz Mar 23 '18 at 15:08
Yeah, `hadd` is useful when you needed 2 shuffles and a vertical add anyway, e.g. as part of a transpose-and-add sort of thing. Or any case where you can use it with 2 *different* input vectors. Or possibly worth it with `haddpd(same,same)` if you want the result broadcast to both elements of a PD vector. – Peter Cordes Mar 24 '18 at 01:54

Vectorization flags with Eigen and IPOPT

1 Answers1