Why does changing 0.1f to 0 slow down performance by 10x?

Question

Why does this bit of code,

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0.1f; // <--
        y[i] = y[i] - 0.1f; // <--
    }
}

run more than 10 times faster than the following bit (identical except where noted)?

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0; // <--
        y[i] = y[i] - 0; // <--
    }
}

when compiling with Visual Studio 2010 SP1. The optimization level was -02 with sse2 enabled. I haven't tested with other compilers.

How did you measure the difference? And what options did you use when you compiled? — James Kanze, Feb 16 '12 at 16:19
Why isn't the compiler just dropping the +/- 0 in this case?!? — Michael Dorgan, Feb 16 '12 at 16:25
This is closely related to the issue in http://stackoverflow.com/questions/5180150/floating-multiplication-performing-slower-depending-of-operands-in-c — Stephen Canon, Feb 16 '12 at 19:46
@Zyx2000 The compiler isn't anywhere near that stupid. Disassembling a trivial example in LINQPad shows that it spits out the same code whether you use `0`, `0f`, `0d`, or even `(int)0` in a context where a `double` is needed. — millimoose, Feb 17 '12 at 02:20
@ Otto Allmendinger the optimization level I used was 02 with sse2 enabled — GlassFish, Feb 17 '12 at 09:03
@Dragarro - Just curious - how did you end up writing and timing this code? — Vic, Feb 17 '12 at 10:31
@Vic I was just fooling around when i noticed it, just timing different functions and checking different compiler optimization flags. At first the code only had the multiplication and division part, and I thought of adding addition and subtraction just to see how much extra time it would take, and to my surprise the code had sped up by a factor of 10. I checked to see if i wrote something odd, and then I ended up with posting it here to see if maybe I missed something. — GlassFish, Feb 17 '12 at 13:20
@HamidNazari - An online diff pastebin is a neat idea, but this one seems to be spitting back "can't open file". It appears to default to "Don't store diff", I selected "store forever" and it gave me this link: http://diffchecker.com/U6w74qj — Kevin Vermeer, Mar 11 '12 at 18:04
See http://stackoverflow.com/questions/5180150/floating-multiplication-performing-slower-depending-of-operands-in-c — Anujith, Feb 15 '13 at 08:20
[Why, really, isn't the compiler dropping the +/-0?](http://stackoverflow.com/questions/16477037/why-does-msvs-not-optimize-away-0-insted-it-turns-it-into-a-denormalized-float) — Vorac, May 10 '13 at 07:12

score 1656 · Accepted Answer · edited Feb 17 '12 at 01:19

1656

Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!!

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can't handle them directly and must trap and resolve them using microcode.

If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0 or 0.1 is used.

Here's the test code compiled on x64:

int main() {

    double start = omp_get_wtime();

    const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
    const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
    float y[16];
    for(int i=0;i<16;i++)
    {
        y[i]=x[i];
    }
    for(int j=0;j<9000000;j++)
    {
        for(int i=0;i<16;i++)
        {
            y[i]*=x[i];
            y[i]/=z[i];
#ifdef FLOATING
            y[i]=y[i]+0.1f;
            y[i]=y[i]-0.1f;
#else
            y[i]=y[i]+0;
            y[i]=y[i]-0;
#endif

            if (j > 10000)
                cout << y[i] << "  ";
        }
        if (j > 10000)
            cout << endl;
    }

    double end = omp_get_wtime();
    cout << end - start << endl;

    system("pause");
    return 0;
}

Output:

#define FLOATING
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007

//#define FLOATING
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.46842e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.45208e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044

Note how in the second run the numbers are very close to zero.

Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.

To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code:

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Then the version with 0 is no longer 10x slower and actually becomes faster. (This requires that the code be compiled with SSE enabled.)

This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.

Timings: Core i7 920 @ 3.5 GHz:

//  Don't flush denormals to zero.
0.1f: 0.564067
0   : 26.7669

//  Flush denormals to zero.
0.1f: 0.587117
0   : 0.341406

In the end, this really has nothing to do with whether it's an integer or floating-point. The 0 or 0.1f is converted/stored into a register outside of both loops. So that has no effect on performance.

edited Feb 17 '12 at 01:19

johv

3,774
3
24
38

answered Feb 16 '12 at 16:20

Mysticial

438,104
44
323
322

Only difference in the generated code is `fld qword ptr [__real@3fb99999a0000000 (11820E8h)]` for fast version and `fldz` at the start of the loop by the way, at least for my build in VS2010 – Dervall Feb 16 '12 at 16:27
4

@Dervall That's correct. There's almost no difference in code. It's the value that affects whether or not the numbers go denormal. – Mysticial Feb 16 '12 at 16:30
5

It's particularly interesting, because the sequence which changes things is fundamentally `x += 0.1; x -= 0.1`, which can also be written `(x + 0.1) - 0.1`. One notes the lack of associativity. (Of course, rewriting like this could change the results, due to the fact that C++ allows keeping intermediate results in extended precision.) – James Kanze Feb 16 '12 at 17:58
105

I'm still finding it a little weird that the "+ 0" isn't completely optimized out by the compiler by default. Would this have happened if he had put "+ 0.0f"? – s73v3r Feb 16 '12 at 19:10
52

@s73v3r That's a very good question. Now that I look at the assembly, not even `+ 0.0f` gets optimized out. If I had to guess, it could be that `+ 0.0f` would have side-effects if `y[i]` happened to be a signalling `NaN` or something... I could be wrong though. – Mysticial Feb 16 '12 at 19:31
How is the performance if one takes the addition/substraction out? I wouldn't be surprised if it is still expensive, because the multiplication/division is more expensive for subnorms. – CodesInChaos Feb 16 '12 at 20:05
3

@CodeInChaos Yes, it's still expensive, but only `13.3208` seconds as opposed to `26.7669`. The numbers still go denormal. But I guess since there's only half as many operations, it's twice as fast. It's somewhat surprising to me that a denormal division seems to run only just as slow as a denormal addition/subtraction. – Mysticial Feb 16 '12 at 20:21
1

So..... what should the performance-conscious programmer do when there are zero-valued floats involved? – Chriszuma Feb 16 '12 at 21:19
1

@Chriszuma, it's not the zero values here that are the problem, it's the near-zero denormals. If you don't need to do arithmetic on super tiny numbers, then flush denormals to zero. If you do need to handle numbers that small, then it's probably faster to use doubles on most platforms. – user57368 Feb 16 '12 at 21:44
14

Doubles will still run into the same problem in many cases, just at a different numerical magnitude. Flush-to-zero is fine for audio applications (and others where you can afford to lose 1e-38 here and there), but I believe doesn't apply to x87. Without FTZ, the usual fix for audio applications is to inject a very low amplitude (not audible) DC or or square wave signal to jitter numbers away from denormality. – Russell Borogove Feb 17 '12 at 00:12
@RussellBorogove: I wonder if any hardware designs include a mode to force LSB truncation of values near zero, so as to avoid a requirement to handle denormalized values weirdly? – supercat Feb 17 '12 at 04:13
2

Around 1998/1999 I used to play with plugins for Jeskola Buzz. I upgraded from Pentium 166MMX to Pentium 3 500MHz for that. Part of the lore then was that as values decayed very close to zero, "underflow exceptions" would cause severe slowdown - even with the interrupt/trap/whatever disabled, it would still cause a big slowdown. The fix was to force near-zero values to zero. Was this lore wrong? Was this actually a denormalized floats issue? – Steve314 Feb 17 '12 at 07:39
1

Putting aside the performance, why `y[i] = (y[i] + 0.1f) - 0.1f` even leads to a different value than `y[i] = (y[i] + 0) - 0` ? – Isaac Feb 17 '12 at 11:09
16

@Isaac because when y[i] is significantly smaller than 0.1 adding it results in a loss of precision because the most significant digit in the number becomes higher. – Dan Is Fiddling By Firelight Feb 17 '12 at 13:28
3

This brings a curiosity question: do you have examples of hardware vendors doing efficient denormal computation? – nraynaud Feb 17 '12 at 14:10
@supercat, if I understand you rightly, that's essentially what flush-to-zero is. – Russell Borogove Feb 17 '12 at 18:27
4

@Steve314, the lore is, let's say, imprecise. IEEE754 defines an "underflow exception" for denormalized numbers; if the hardware isn't able to handle denorms accurately, it's supposed to raise an exception instead, but x87 does handle the denormals, slowly, without raising exception. – Russell Borogove Feb 17 '12 at 18:28
6

@nraynaud, Intel's new Sandy Bridge microarchitecture handles denormals remarkably efficiently -- according to Agner Fog, at full speed. http://www.agner.org/optimize/blog/read.php?i=142 – Russell Borogove Feb 17 '12 at 18:29
2

@RussellBorogove: In a flush-to-zero system, as I understand the term, the difference between the smallest and second-smallest positive numbers is a tiny fraction of the magnitude of the smallest number. Adding denormals to the system reduces the magnitude of the smallest number to match the smallest measurable dfference between numbers. What I'd suggest would be increasing the minimum difference between numbers to match the smallest normalized number. This would ensure that if x!=y the difference between x+(y-x) and y would be of smaller magnitude than the difference between x and y. – supercat Feb 17 '12 at 19:27
@RussellBorogove: Whereas handling of denormals requires adding complexity to all aspects of handling floats, limiting the precision of lower bits (should actually be done via rounding rather than truncation) would only require extra logic in one place. Instead of always rounding to one mantissa-LSB, one would sometimes round off to a coarser level. – supercat Feb 17 '12 at 19:34
Does this have any relevance for Obj-C. I'd guess the answer would be, "probably, benchmark it yourself?" Or could this not apply in Objective-C for some reason...? – Dan Rosenstark Feb 18 '12 at 07:18
6

@Yar: it is more of a CPU than a language issue, so it probably has relevance for Obj-C on x86. (iPhone's armv7 doesn't seem to support denormalized floats, at least with the default runtime/build settings) – mvds Feb 18 '12 at 11:56
4

@RussellBorogove I've got a Sandy bridge CPU here (i7-2620M), and I can't observe that efficiency - I see a 22x slowdown with gcc 4.6.1 and `g++ -O3 -march=corei7-avx 9314534.cpp`. Similar results with all compiler options I've tried. – je4d Feb 23 '12 at 21:30
1

@je4d I noticed it on my 2600K as well. And it does go contrary to what Agner Fog says. I initially suspected that the perhaps the division was still sensitive to denormals. But removing the division (and multiplying by reciprocals of `z[i]` instead) yields a slowdown of 96x to denormals!!! – Mysticial Feb 23 '12 at 21:47
@Mysticial all my google search results for snb and denormalized FP lead back to that one Agner Fog article. I've also been trying various compiler options (e.g. -mfpmath=..., -mfast-math) since my last comment and I can't find any where 1) -DFLOATING doesn't make it faster and 2) the results without -DFLOATING are correct. So AFAICS, the claim is just bogus. – je4d Feb 23 '12 at 22:03
Interesting. Are you seeing this on x87, SSE, or both? – Russell Borogove Feb 23 '12 at 22:28
@RussellBorogove I tested x87 in 32-bit mode and SSE in 64-bit mode. Same results. Huge slowdowns to denormals both with and without the division. ~25x with the division and ~100x without the division. – Mysticial Feb 23 '12 at 22:34
No clue what's up then. You might contact Fog if you want to sort it out. – Russell Borogove Feb 23 '12 at 22:39
@RussellBorogove I'm doing 64bit only, and I see a slowdown for both 387 (50x) and sse (14.5x) with doubles - similar numbers for floats. – je4d Feb 23 '12 at 23:10
@mvds Just to confirm this is not a single compiler thing, I can see the same slowdown in Delphi. – EMBarbosa Mar 12 '12 at 13:21
176

@s73v3r: The +0.f cannot be optimized out because floating-point has a negative 0, and the result of adding +0.f to -.0f is +0.f. So adding 0.f is not an identity operation and cannot be optimized out. – Eric Postpischil Jul 06 '12 at 17:59
Fog's claim that denormals do not increase latency since Sandy Bridge seems pretty wrong. Maybe since Skylake it is right, but certainly not on Haswell. – hdl Dec 17 '15 at 16:28
3

@hdl Fog initially claimed that Sandy Bridge didn't have denormal slowdowns. But after this question got popular, I believe someone notified him and he retested. The verdict was, no slowdown for additions/subtractions. But still a huge slowdown for multiplications. – Mysticial Dec 17 '15 at 20:34
@Mysticial Well he *still* claims "Denormal numbers, NAN's and infinity do not increase the latency." in the December 2014 PDF of "Instruction tables" for Sandy Bridge, Ivy Bridge and Haswell. – hdl Dec 18 '15 at 10:49
@Eric Postpischil The optimzation is not even the issue. The problem arises even when you remove the addition completly. It is due to the division that the numbers get denormalized. – user32434999 Jul 18 '18 at 13:02
@Pi: The comment to which you are responding does not address the original problem posted here, in which subnormal numbers caused performance penalties. It is about an issue raised by s73v3r regarding why adding 0 is not removed during optimization. – Eric Postpischil Jul 18 '18 at 20:09

score 420 · Answer 2 · edited May 23 '17 at 12:18

Using gcc and applying a diff to the generated assembly yields only this difference:

73c68,69
<   movss   LCPI1_0(%rip), %xmm1
---
>   movabsq $0, %rcx
>   cvtsi2ssq   %rcx, %xmm1
81d76
<   subss   %xmm1, %xmm0

The cvtsi2ssq one being 10 times slower indeed.

Apparently, the float version uses an XMM register loaded from memory, while the int version converts a real int value 0 to float using the cvtsi2ssq instruction, taking a lot of time. Passing -O3 to gcc doesn't help. (gcc version 4.2.1.)

(Using double instead of float doesn't matter, except that it changes the cvtsi2ssq into a cvtsi2sdq.)

Update

Some extra tests show that it is not necessarily the cvtsi2ssq instruction. Once eliminated (using a int ai=0;float a=ai; and using a instead of 0), the speed difference remains. So @Mysticial is right, the denormalized floats make the difference. This can be seen by testing values between 0 and 0.1f. The turning point in the above code is approximately at 0.00000000000000000000000000000001, when the loops suddenly takes 10 times as long.

Update<<1

A small visualisation of this interesting phenomenon:

Column 1: a float, divided by 2 for every iteration
Column 2: the binary representation of this float
Column 3: the time taken to sum this float 1e7 times

You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower.

0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 45 ms
0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 43 ms
0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 43 ms
0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 42 ms
0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 48 ms
0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 43 ms
0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 42 ms
0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 42 ms
0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 42 ms
0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 43 ms
0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 42 ms
0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 44 ms
0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 42 ms
0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 42 ms
0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 789 ms
0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 788 ms
0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 788 ms
0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 795 ms
0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 896 ms
0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 813 ms
0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 798 ms
0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 791 ms
0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 802 ms
0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 809 ms
0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 795 ms
0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 835 ms
0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 864 ms
0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 915 ms
0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 918 ms
0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 881 ms
0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 857 ms
0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 861 ms
0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 855 ms
0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 887 ms
0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 799 ms
0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 828 ms
0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 815 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 44 ms

An equivalent discussion about ARM can be found in Stack Overflow question Denormalized floating point in Objective-C?.

`-O`s don't fix it, but `-ffast-math` does. (I use that all the time, IMO the corner cases where it causes precision trouble shouldn't turn up in a properly designed program anyway.) — leftaroundabout, Feb 17 '12 at 10:17
There is no conversion at any positive optimization level with gcc-4.6. — Jed, Mar 11 '12 at 16:14
@leftaroundabout: compiling an executable (not library) with `-ffast-math` links some extra startup code that sets FTZ (flush to zero) and DAZ (denormal are zero) in the MXCSR, so the CPU never has to take a slow microcode assist for denormals. — Peter Cordes, Jan 16 '19 at 10:23

score 36 · Answer 3 · edited Jul 23 '14 at 15:53

It's due to denormalized floating-point use. How to get rid of both it and the performance penalty? Having scoured the Internet for ways of killing denormal numbers, it seems there is no "best" way to do this yet. I have found these three methods that may work best in different environments:

Might not work in some GCC environments:

// Requires #include <fenv.h>
fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV);

Might not work in some Visual Studio environments: 1

// Requires #include <xmmintrin.h>
_mm_setcsr( _mm_getcsr() | (1<<15) | (1<<6) );
// Does both FTZ and DAZ bits. You can also use just hex value 0x8040 to do both.
// You might also want to use the underflow mask (1<<11)

Appears to work in both GCC and Visual Studio:

// Requires #include <xmmintrin.h>
// Requires #include <pmmintrin.h>
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);

The Intel compiler has options to disable denormals by default on modern Intel CPUs. More details here
Compiler switches. -ffast-math, -msse or -mfpmath=sse will disable denormals and make a few other things faster, but unfortunately also do lots of other approximations that might break your code. Test carefully! The equivalent of fast-math for the Visual Studio compiler is /fp:fast but I haven't been able to confirm whether this also disables denormals.1

This sounds like a decent answer to a different but related question (How can I prevent numeric computations from producing denormal results?) It doesn't answer this question, though. — Ben Voigt, Jun 21 '14 at 21:28
Windows X64 passes a setting of abrupt underflow when it launches .exe, while Windows 32-bit and linux do not. On linux, gcc -ffast-math should set abrupt underflow (but I think not on Windows). Intel compilers are supposed to initialize in main() so that these OS differences don't pass through, but I've been bitten, and need to set it explicitly in the program. Intel CPUs beginning with Sandy Bridge are supposed to handle subnormals arising in add/subtract (but not divide/multiply) efficiently, so there is a case for using gradual underflow. — tim18, Jun 11 '16 at 20:03
Microsoft /fp:fast (not a default) doesn't do any of the aggressive things inherent in gcc -ffast-math or ICL (default) /fp:fast. It's more like ICL /fp:source. So you must set /fp: (and, in some cases, underflow mode) explicitly if you wish to compare these compilers. — tim18, Jun 11 '16 at 20:09

score 19 · Answer 4 · answered Oct 02 '12 at 04:40

19

In gcc you can enable FTZ and DAZ with this:

#include <xmmintrin.h>

#define FTZ 1
#define DAZ 1   

void enableFtzDaz()
{
    int mxcsr = _mm_getcsr ();

    if (FTZ) {
            mxcsr |= (1<<15) | (1<<11);
    }

    if (DAZ) {
            mxcsr |= (1<<6);
    }

    _mm_setcsr (mxcsr);
}

also use gcc switches: -msse -mfpmath=sse

(corresponding credits to Carl Hetherington [1])

[1] http://carlh.net/plugins/denormals.php

answered Oct 02 '12 at 04:40

German Garcia

1,150
13
13

Also see `fesetround()` from `fenv.h` (defined for C99) for another, more portable way of rounding (http://linux.die.net/man/3/fesetround) (but this [would affect all FP operations, not just subnormals](http://gcc.gnu.org/ml/gcc-help/2011-12/msg00163.html)) – German Garcia Oct 02 '12 at 13:52
Are you sure you need 1<<15 and 1<<11 for FTZ? I have only seen 1<<15 quoted elsewhere... – fig Feb 26 '14 at 11:45
@fig: 1<<11 is for the Underflow Mask. More info here: http://softpixel.com/~cwright/programming/simd/sse.php – German Garcia Feb 26 '14 at 16:29
@GermanGarcia this does not answer the OPs question; the question was "Why does this bit of code, runs 10 times faster than..." - you should either attempt to answer that before providing this workaround or provide this in a comment. – Jul 24 '14 at 23:41

remcycles · Answer 5 · 2020-09-09T19:38:35.963

Dan Neely's comment ought to be expanded into an answer:

It is not the zero constant 0.0f that is denormalized or causes a slow down, it is the values that approach zero each iteration of the loop. As they come closer and closer to zero, they need more precision to represent and they become denormalized. These are the y[i] values. (They approach zero because x[i]/z[i] is less than 1.0 for all i.)

The crucial difference between the slow and fast versions of the code is the statement y[i] = y[i] + 0.1f;. As soon as this line is executed each iteration of the loop, the extra precision in the float is lost, and the denormalization needed to represent that precision is no longer needed. Afterwards, floating point operations on y[i] remain fast because they aren't denormalized.

Why is the extra precision lost when you add 0.1f? Because floating point numbers only have so many significant digits. Say you have enough storage for three significant digits, then 0.00001 = 1e-5, and 0.00001 + 0.1 = 0.1, at least for this example float format, because it doesn't have room to store the least significant bit in 0.10001.

In short, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; isn't the no-op you might think it is.

Mystical said this as well: the content of the floats matters, not just the assembly code.

EDIT: To put a finer point on this, not every floating point operation takes the same amount of time to run, even if the machine opcode is the same. For some operands/inputs, the same instruction will take more time to run. This is especially true for denormal numbers.

Why does changing 0.1f to 0 slow down performance by 10x?

5 Answers5

Linked

Related