Auto vectorization on double and ffast-math

Question

Why is it mandatory to use -ffast-math with g++ to achieve the vectorization of loops using doubles? I don't like -ffast-math because I don't want to lose precision.

`-ffast-math` is actually a combination flag which sets a set of other flags which can be enabled individually instead - perhaps you might be able to get away with only setting one or two of the individual flags instead? — Amber, May 17 '10 at 20:59
I tried, but only with `--fast-math` I get the maximum number of vectorized loops — Ruggero Turra, May 17 '10 at 21:03

score 9 · Accepted Answer · answered Sep 17 '11 at 14:21

You don’t necessarily lose precision with -ffast-math. It only affects the handling of NaN, Inf etc. and the order in which operations are performed.

If you have a specific piece of code where you do not want GCC to reorder or simplify computations, you can mark variables as being used using an asm statement.

For instance, the following code performs a rounding operation on f. However, the two f += g and f -= g operations are likely to get optimised away by gcc:

static double moo(double f, double g)                                      
{                                                                          
    g *= 4503599627370496.0; // 2 ** 52                                    
    f += g;                                                                
    f -= g;                                                                
    return f;                                                            
}

On x86_64, you can use this asm statement to instruct GCC not to perform that optimisation:

static double moo(double f, double g)                                      
{                                                                          
    g *= 4503599627370496.0; // 2 ** 52                                    
    f += g;                                                                
    __asm__("" : "+x" (f));
    f -= g;
    return f;
}

You will need to adapt this for each architecture, unfortunately. On PowerPC, use +f instead of +x.

Simply making `f` a `volatile double` should have the same effect. — Benjamin Buch, Oct 05 '17 at 19:20
Not quite so, unfortunately. There is a side effect to using `volatile`, which is that the variable is always written to memory. In this example, all the operations could be performed in registers, but with `volatile` the compiler will emit several additional read/write opcodes. So even if the result is the same, the code will be a lot slower. — sam hocevar, Oct 06 '17 at 17:08

score 3 · Answer 2 · edited Jul 08 '19 at 16:59

3

Very likely because vectorization means that you may have different results, or may mean that you miss floating point signals/exceptions.

If you're compiling for 32-bit x86 then gcc and g++ default to using the x87 for floating point math, on 64-bit they default to SSE, however the x87 can and will produce different values for the same computation so it's unlikely g++ will consider vectorizing if it can't guarantee that you will get the same results unless you use -ffast-math or some of the flags it turns on.

Basically it comes down to the floating point environment for vectorized code may not be the same as the one for non vectorized code, sometimes in ways that are important, if the differences don't matter to you, something like

-fno-math-errno -fno-trapping-math -fno-signaling-nans -fno-rounding-math

but first look up those options and make sure that they won't affect your program's correctness. -ffinite-math-only may help also

edited Jul 08 '19 at 16:59

phuclv

27,258
11
104
360

answered Jun 14 '10 at 16:29

Spudd86

2,948
18
19

I think gcc actually isn't too careful about x87 vs. SSE rounding-mode and other settings (like denormal FTZ and DAZ). Related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678: changing the rounding mode between FP operations requires `#pragma STDC FENV_ACCESS ON`, and even then gcc doesn't fully support that. But I think gcc generally doesn't assume that the mode is round-to-nearest when optimizing code that doesn't itself change the rounding mode. – Peter Cordes Aug 26 '17 at 18:41
@PeterCordes on x87 there's no way to do certain things with exactly rounded float/double only the 80 bit extended x87 type. Don't recall what those are I learned this while reading an article years ago. gcc DOES avoid those kinds of things hell it used to not constant fold floating point because someone somewhere might be cross compiling onto something weird and the results wouldn't be the same, theses days they have an internal library that can simulate weird floating point platforms so it will constant fold as long as the result won't differ. – Spudd86 Aug 28 '17 at 06:46
No *good* way, anyway. There's `gcc -ffloat-store` which stores/reloads to round to `float` or `double`, but of course that costs massive amounts of performance. Also, x87 has a precision-control bit that you can set to limit mantissa precision to 53 or 24 (for faster div/sqrt), instead of 64. See https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/: apparently MSVC used to actually set it to reduced precision! – Peter Cordes Aug 28 '17 at 07:07

phuclv · Answer 3 · 2020-11-16T03:44:03.050

Because `-ffast-math` enables operands reordering which allows many code to be vectorized.

For example to calculate this

sum = a[0] + a[1] + a[2] + a[3] + a[4] + a[5] + … a[99]

the compiler is required to do the additions sequentially without -ffast-math, because floating-point math is neither commutative nor associative.

That's the same reason why compilers can't optimize a*a*a*a*a*a to (a*a*a)*(a*a*a) without -ffast-math

That means no vectorization available unless you have very efficient horizontal vector adds.

However if -ffast-math is enabled, the expression can be calculated like this (Look at A7. Auto-Vectorization)

sum0 = a[0] + a[4] + a[ 8] + … a[96]
sum1 = a[1] + a[5] + a[ 9] + … a[97]
sum2 = a[2] + a[6] + a[10] + … a[98]
sum3 = a[3] + a[7] + a[11] + … a[99]
sum’ = sum0 + sum1 + sum2 + sum3

Now the compiler can vectorize it easily by adding each column in parallel and then do a horizontal add at the end

Does sum’ == sum? Only if (a[0]+a[4]+…) + (a[1]+a[5]+…) + (a[2]+a[6]+…) + ([a[3]+a[7]+…) == a[0] + a[1] + a[2] + … This holds under associativity, which floats don’t adhere to, all of the time. Specifying /fp:fast lets the compiler transform your code to run faster – up to 4 times faster, for this simple calculation.

Do You Prefer Fast or Precise? - A7. Auto-Vectorization

It may be enabled by the -fassociative-math flag in gcc

Auto vectorization on double and ffast-math

3 Answers3

Because `-ffast-math` enables operands reordering which allows many code to be vectorized.

Further readings

Linked

Auto vectorization on double and ffast-math

3 Answers3

Because -ffast-math enables operands reordering which allows many code to be vectorized.

Further readings

Linked

Because `-ffast-math` enables operands reordering which allows many code to be vectorized.