Performance of pow(x,3.0f) vs xxx?

Question

The following program...

int main() {
    float t = 0;
    for (int i = 0; i < 1'000'000'000; i++) {
        const float x = i;
        t += x*x*x;
    }
    return t;
}

...takes about 900ms to complete on my machine. Whereas...

#include <cmath>

int main() {
    float t = 0;
    for (int i = 0; i < 1'000'000'000; i++) {
        const float x = i;
        t += std::pow(x,3.0f);
    }
    return t;
}

...takes about 6600ms to complete.

I'm kind of suprised that the optimizer doesn't inline the std::pow function so that the two programs produce the same code and have identical performance.

Any insights? How do you account for the 5x performance difference?

For reference I'm using gcc -O3 on Linux x86

Update: (C Version)

int main() {
    float t = 0;
    for (int i = 0; i < 1000000000; i++) {
        const float x = i;
        t += x*x*x;
    }
    return t;
}

...takes about 900ms to complete on my machine. Whereas...

#include <math.h>

int main() {
    float t = 0;
    for (int i = 0; i < 1000000000; i++) {
        const float x = i;
        t += powf(x,3.0f);
    }
    return t;
}

...takes about 6600ms to complete.

Update 2

The following program:

#include <math.h>

int main() {
    float t = 0;
    for (int i = 0; i < 1000000000; i++) {
        const float x = i;
        t += __builtin_powif(x,3.0f);
    }
    return t;
}

runs in 900ms like the first program.

Why isn't pow being inlined to __builtin_powif ?

Update 3:

With -ffast-math the following program:

#include <math.h>
#include <iostream>

int main() {
    float t = 0;
    for (int i = 0; i < 1'000'000'000; i++) {
            const float x = i;
            t += powf(x, 3.0f);
    }
    std::cout << t;
}

runs in 227ms (as does the x*x*x version). That's 200 picoseconds per iteration. Using -fopt-info it says optimized: loop vectorized using 16 byte vectors and optimized: loop with 2 iterations completely unrolled so I guess that means its doing iterations in batches of 4 for SSE and doing 2 iterations at once pipelining (for a total of 8 iterations at once), or something like that?

Your call to `std::pow` uses a floating point value as exponent (you can calculate roots with it), while `x*x*x` is way more simple. — Simon Kraemer, Feb 11 '21 at 07:46
As you should know, please don't tag both C and C++ unless the question really is about both languages (translation between them or similar). Your code is C++ specific and can't be built with a C compiler. — Some programmer dude, Feb 11 '21 at 07:46
And what happens if you use the correct function `powf` instead of the double precision one? — Lundin, Feb 11 '21 at 07:51
Also, isn't this calculation going to overflow in terrible ways? — Lundin, Feb 11 '21 at 07:51
You might get a more reliable measurement if you call `pow` in advance of the loop, then time just the loop part. — Bathsheba, Feb 11 '21 at 07:52
@klutt: I removed the comment - I had miscounted the number of zeros in the long constants. — Bathsheba, Feb 11 '21 at 07:55
Since you're using GCC, try `t += __builtin_powif(x, 3)` -- the results should be illuminating. — Travis Gockel, Feb 11 '21 at 07:56
@TravisGockel: `__builtin_powif(x, 3.0f)` does indeed produce 900ms result. Why isnt `pow` being inlined to `__builtin_powif(x, 3.0f)` ? — Andrew Tomazos, Feb 11 '21 at 07:58
I tend to agree that GCC should be smart enough to figure out that `std::pow` is the same thing...what version of GCC are you using? — Travis Gockel, Feb 11 '21 at 08:00
Add `-fopt-info` and it will tell you that it vectorized the loop. The final result is likely different though. — Marc Glisse, Feb 11 '21 at 08:28
@MarcGlisse: I changed the loop bound to be dynamic based on `stoi(argv[1])`, still performed 227ms. I then added `-fopt-info` and indeed it says: `optimized: loop vectorized using 16 byte vectors`. So it must be SSE vectorizing 4 iterations at once. This explains the 200 picosecond iteration time which is below one clock cycle. It still seems insanely fast, even considering that. — Andrew Tomazos, Feb 11 '21 at 08:31
@MarcGlisse: It also says `optimized: loop with 2 iterations completely unrolled` so I guess that means its doing two iterations at once for a total of 8 at once maybe. — Andrew Tomazos, Feb 11 '21 at 08:33
The unrolling is probably just the last few iterations (in case it isn't a multiple of 4). — Marc Glisse, Feb 11 '21 at 08:42
`-march=native` would likely give you AVX vectorization, unless your processor is very old. — Marc Glisse, Feb 11 '21 at 08:45

score 3 · Accepted Answer · answered Feb 11 '21 at 08:06

3

The doc page about gcc builtins is explicit (emphasize mine):

Built-in Function: double __builtin_powi (double, int)

Returns the first argument raised to the power of the second. Unlike the pow function no guarantees about precision and rounding are made.

Built-in Function: float __builtin_powif (float, int)

Similar to __builtin_powi, except the argument and return types are float.

As __builtin_powif has equivalent performances to a a mere product, it means that the additional time is used to the controls required by pow for its guarantees about precision and rounding.

answered Feb 11 '21 at 08:06

Serge Ballesta

121,548
10
94
199

Ok, so `x*x*x` and `powf(x,3)` will produce a different result for some values of x? That makes sense then I guess, if powf is more precise. – Andrew Tomazos Feb 11 '21 at 08:08
@AndrewTomazos: I would assume yes. The only real way to go further is to examine the source code of `pow`/`powf` in gcc library. – Serge Ballesta Feb 11 '21 at 08:11
Interestingly, this behavior changed in C++11. The libstdc++ definition of `std::pow` which uses `__builtin_powif` only happens in `#if __cplusplus < 201103L`. – Travis Gockel Feb 11 '21 at 08:12
@TravisGockel I belive that's due to [`std::pow` (7)](https://en.cppreference.com/w/cpp/numeric/math/pow) *"If any argument has integral type, it is cast to `double`"*. [Here](https://timsong-cpp.github.io/cppwp/cmath.syn) only the floating point overloads are listed. Also: https://stackoverflow.com/a/6626234/4944425 – Bob__ Feb 11 '21 at 10:00

Hanjoung Lee · Answer 2 · 2021-02-11T14:51:05.973

0

% Assuming your compiler chose to just call pow in the shared library like https://godbolt.org/z/re3baK (without -ffast-math)

I did not take a look at how pow(float, float) is implemented, but I see some points.

x*x*x is inlined while pow can't be as it is in a shared library - function call overhead difference
Whether the exponent 3.0 is constant? If compiler know something is constant, it is likely to generate more efficient code
- x*x*x : Just generates assembly for float value multiplication twice
- pow : This must have considered all the exponent values so probably it has general code(less efficient, may include loops)

edited Feb 11 '21 at 14:51

answered Feb 11 '21 at 08:09

Hanjoung Lee

1,923
1
9
16

1

The compiler can inline or optimize `pow` because the C standard specifies how it behaves, and the external identifier `pow` is reserved for that purpose, so the compiler knows what it is required to do. Further, compilers and libraries may be bundled as part of one C implementation, so the implementation would have complete control over both the code generated by the compiler and the implementation of `pow`. – Eric Postpischil Feb 11 '21 at 11:53
Well, I was answering to the question "why there is 5x difference?". And with `-O3` (without `-ffast-math`) it really calls `pow` function and it is slow. https://godbolt.org/z/Kx71W7 Let me know if i missed something. – Hanjoung Lee Feb 11 '21 at 12:07
1

If a particular compiler does not inline `pow` in a particular circumstance, then the reason is not that it cannot be inlined because `pow` is in a shared library; the reason is that the compiler’s design does not inline pow. This can be complicated by other considerations, but it is not a requirement of the C standard. – Eric Postpischil Feb 11 '21 at 12:14
I get your point, and I'm not saying what I said is part of the standard. Just wanted to analyze the significant performance difference the OP asked. – Hanjoung Lee Feb 11 '21 at 14:35
1

The point is the statement is “`pow` can’t be as it is in a shared library” is false. Regardless of what you intended to analyze or convey, this answer presents false information and is therefore detrimental. In fact, [GCC and Clang do replace some calls to `pow` with inline code](https://godbolt.org/z/7PMEsq), including `pow(x, 3)` if `-ffast-math` is enabled. The second item, about constants, also misstates the situation; the code in the question has a hardcoded `3.0f`, so it is a constant; the text in this answer about having “general code” or consider all exponent values is inapplicable. – Eric Postpischil Feb 11 '21 at 14:43
Oh I get that and updated the answer. Thanks. – Hanjoung Lee Feb 11 '21 at 14:53

Performance of pow(x,3.0f) vs x*x*x?

2 Answers2

Performance of pow(x,3.0f) vs xxx?