Floating Point Div/Mul > 30 times slower than Add/Sub?

Question

I recently read this post: Floating point vs integer calculations on modern hardware and was curious as to the performance of my own processor on this quasi-benchmark, so I put together two versions of the code, one in C# and one in C++ (Visual Studio 2010 Express) and compiled them both with optimizations to see what falls out. The output from my C# version is fairly reasonable:

int add/sub: 350ms
int div/mul: 3469ms
float add/sub: 1007ms
float div/mul: 67493ms
double add/sub: 1914ms
double div/mul: 2766ms

When I compiled and ran the C++ version something completely different shook out:

int add/sub: 210.653ms
int div/mul: 2946.58ms
float add/sub: 3022.58ms
float div/mul: 172931ms
double add/sub: 1007.63ms
double div/mul: 74171.9ms

I expected some performance differences, but not this large! I don't understand why the division/multiplication in C++ is so much slower than addition/subtraction, where the managed C# version is more reasonable to my expectations. The code for the C++ version of the function is as follows:

template< typename T> void GenericTest(const char *typestring)
{
    T v = 0;
    T v0 = (T)((rand() % 256) / 16) + 1;
    T v1 = (T)((rand() % 256) / 16) + 1;
    T v2 = (T)((rand() % 256) / 16) + 1;
    T v3 = (T)((rand() % 256) / 16) + 1;
    T v4 = (T)((rand() % 256) / 16) + 1;
    T v5 = (T)((rand() % 256) / 16) + 1;
    T v6 = (T)((rand() % 256) / 16) + 1;
    T v7 = (T)((rand() % 256) / 16) + 1;
    T v8 = (T)((rand() % 256) / 16) + 1;
    T v9 = (T)((rand() % 256) / 16) + 1;

    HTimer tmr = HTimer();
    tmr.Start();
    for (int i = 0 ; i < 100000000 ; ++i)
    {
        v += v0;
        v -= v1;
        v += v2;
        v -= v3;
        v += v4;
        v -= v5;
        v += v6;
        v -= v7;
        v += v8;
        v -= v9;
    }
    tmr.Stop();

      // I removed the bracketed values from the table above, they just make the compiler
      // assume I am using the value for something do it doesn't optimize it out.
    cout << typestring << " add/sub: " << tmr.Elapsed() * 1000 << "ms [" << (int)v << "]" << endl;

    tmr.Start();
    for (int i = 0 ; i < 100000000 ; ++i)
    {
        v /= v0;
        v *= v1;
        v /= v2;
        v *= v3;
        v /= v4;
        v *= v5;
        v /= v6;
        v *= v7;
        v /= v8;
        v *= v9;
    }
    tmr.Stop();

    cout << typestring << " div/mul: " << tmr.Elapsed() * 1000 << "ms [" << (int)v << "]" << endl;
}

The code for the C# tests are not generic, and are implemented thus:

static double DoubleTest()
{
    Random rnd = new Random();
    Stopwatch sw = new Stopwatch();

    double v = 0;
    double v0 = (double)rnd.Next(1, int.MaxValue);
    double v1 = (double)rnd.Next(1, int.MaxValue);
    double v2 = (double)rnd.Next(1, int.MaxValue);
    double v3 = (double)rnd.Next(1, int.MaxValue);
    double v4 = (double)rnd.Next(1, int.MaxValue);
    double v5 = (double)rnd.Next(1, int.MaxValue);
    double v6 = (double)rnd.Next(1, int.MaxValue);
    double v7 = (double)rnd.Next(1, int.MaxValue);
    double v8 = (double)rnd.Next(1, int.MaxValue);
    double v9 = (double)rnd.Next(1, int.MaxValue);

    sw.Start();
    for (int i = 0; i < 100000000; i++)
    {
        v += v0;
        v -= v1;
        v += v2;
        v -= v3;
        v += v4;
        v -= v5;
        v += v6;
        v -= v7;
        v += v8;
        v -= v9;
    }
    sw.Stop();

    Console.WriteLine("double add/sub: {0}", sw.ElapsedMilliseconds);
    sw.Reset();

    sw.Start();
    for (int i = 0; i < 100000000; i++)
    {
        v /= v0;
        v *= v1;
        v /= v2;
        v *= v3;
        v /= v4;
        v *= v5;
        v /= v6;
        v *= v7;
        v /= v8;
        v *= v9;
    }
    sw.Stop();

    Console.WriteLine("double div/mul: {0}", sw.ElapsedMilliseconds);
    sw.Reset();

    return v;
}

Any ideas here?

What optimization settings are you using in C++? Are you running this inside Visual Studio's test host (the timings seem slow to me...)? — Reed Copsey, Jul 20 '10 at 18:00
They seem slow to me also. I am running it from the commandline, Full Program Optimization, Optimize for Speed. 32bit Windows XP. — Chris D., Jul 20 '10 at 18:08
I'm getting really weird results when I try to run your benchmarks, it seems utterly random. One time I get ~200 ms for `double` and `float` division, the next time it can be as much as ~7000 ms. I run it with 10 times less iterations, otherwise it would take too long when it does spaz out. That is on the C# side, on the C++ side I'm seeing that `float` add/sub is 3x slower than C# and the division is consistently slow at 7+ seconds. — JulianR, Jul 20 '10 at 22:28
@JulianR: This is why I posted this question here, I don't quite understand why the strange performance differences. — Chris D., Jul 21 '10 at 12:20

score 4 · Accepted Answer · edited May 23 '17 at 12:06

4

For the float div/mul tests, you're probably getting denormalized values, which are much slower to process that normal floating point values. This isn't an issue for the int tests and would crop up much later for the double tests.

You should be able to add this to the start of the C++ to flush denormals to zero:

_controlfp(_DN_FLUSH, _MCW_DN);

I'm not sure how to do it in C# though (or if it's even possible).

Some more info here: Floating Point Math Execution Time

edited May 23 '17 at 12:06

Community

1
1

answered Jul 20 '10 at 22:05

celion

3,546
22
17

This solved it. Adding that line to the GenericTest function caused it to execute much more reasonably. 1sec for float adds, 1.3sec for float mul/divs, 1sec for double adds, 1.6sec for double mul/divs. – Chris D. Jul 21 '10 at 12:18
I'm still not sure that it's not a flawed benchmark, but at least now it's *less* flawed :) – celion Jul 21 '10 at 19:47

score 3 · Answer 2 · answered Jul 20 '10 at 19:04

3

It's possible that C# optimized the division by vx to multiplication by 1 / vx since it knows those values aren't modified during the loop and it can compute the inverses just once up front.

You can do this optimization yourself and time it in C++.

answered Jul 20 '10 at 19:04

Mark B

91,641
10
102
179

score 2 · Answer 3 · answered Jul 20 '10 at 18:13

If you're interested in floating point speed and possible optimizations, read this book: http://www.agner.org/optimize/optimizing_cpp.pdf

also you can check this: http://msdn.microsoft.com/en-us/library/aa289157%28VS.71%29.aspx

Your results could depend on things such as JIT, compilation flags (debug/release, what kind of FP optimizations to perform or allowed instruction set).

Try setting these flags to max optimizations and change your program, so that it surely won't produce overflows or NANs, because they affect the computation speed. (even something like "v += v1; v += v2; v -= v1; v -= v2;" is ok, because it won't be reduced on "strict" or "precise" floating point mode). Also try not to use more variables than you have FP registers.

score 1 · Answer 4 · answered Jul 20 '10 at 18:59

1

Multiplication isn't bad. I think it is a few cycles slower than addition, but yes, division is very slow, compared to the others. It takes significantly longer, and unlike the other 3 operations, it is not pipelined.

answered Jul 20 '10 at 18:59

jalf

229,000
47
328
537

score 0 · Answer 5 · edited Aug 03 '19 at 05:10

0

I also decided that your C++ was incredibly slow. So I ran it myself. Turns out that actually, you're totally wrong.

I replaced your timer (I've no idea what timer you were using, but I don't have one handy) with the Windows High Performance Timer. That thing can do nanoseconds or better. Guess what? Visual Studio says no. I didn't even tweak it for the highest performance. VS can see right through this sort of crap and ellided all of the loops. That's why you should never, ever, ever use this sort of "profiling". Get a professional profiler and come back. Unless 2010 Express is different to 2010 Professional, which I doubt. They mainly differ in IDE features, not raw code performance/optimization.

I'm not even going to bother running your C#.

Edit: This is DEBUG x64 (the previous screen is x86, but I thought I'd do x64 since I am on x64) and I also fixed a minor bug that caused the time to be negative rather than positive. So unless you want to tell me that your release FP on 32bit is a hundred times slower, I think you've screwed up.

One thing I did find curious is that the x86 debug program never terminated on the second float test, i.e., if you did float first, then double, it was double div/mul that failed. If you did double then float, the float div/mul failed. Must be a compiler glitch.

edited Aug 03 '19 at 05:10

Glorfindel

19,729
13
67
91

answered Jul 20 '10 at 22:25

Puppy

138,897
33
232
446

uhm 0 nanoseconds, are you using any nasa pc? – RvdK Jul 20 '10 at 22:29
@PoweRoy: Read the post. The point is that the compiler optimized all of it away. – Puppy Jul 20 '10 at 22:33
@SoapBox: The OP already posted my code. Frankly, I just couldn't be bothered to type the results out. – Puppy Jul 20 '10 at 23:05
@DeadMG How come the first and second timings are different? First one is only showing 0 (and you don't explain why). – RvdK Jul 21 '10 at 07:31
@PoweRoy: ... Perhaps you could read the post. The second run was in DEBUG mode, i.e., all compiler optimizations disabled and additional overhead for debugging, and it's still a hundred times faster than the OP's time. – Puppy Jul 21 '10 at 08:58
I am using the High-Performance Timer, QueryPerformanceCounter in both cases. I have never had the compiler optimize the loops away, what compiler options did you set? – Chris D. Jul 21 '10 at 12:09
@Chris D: All I did was change it to release. That was it. Didn't go through every possible optimization option or even set overall optimization level. – Puppy Jul 21 '10 at 16:05

Floating Point Div/Mul > 30 times slower than Add/Sub?

5 Answers5