3

The following program...

int main() {
    float t = 0;
    for (int i = 0; i < 1'000'000'000; i++) {
        const float x = i;
        t += x*x*x;
    }
    return t;
}

...takes about 900ms to complete on my machine. Whereas...

#include <cmath>

int main() {
    float t = 0;
    for (int i = 0; i < 1'000'000'000; i++) {
        const float x = i;
        t += std::pow(x,3.0f);
    }
    return t;
}

...takes about 6600ms to complete.

I'm kind of suprised that the optimizer doesn't inline the std::pow function so that the two programs produce the same code and have identical performance.

Any insights? How do you account for the 5x performance difference?

For reference I'm using gcc -O3 on Linux x86

Update: (C Version)

int main() {
    float t = 0;
    for (int i = 0; i < 1000000000; i++) {
        const float x = i;
        t += x*x*x;
    }
    return t;
}

...takes about 900ms to complete on my machine. Whereas...

#include <math.h>

int main() {
    float t = 0;
    for (int i = 0; i < 1000000000; i++) {
        const float x = i;
        t += powf(x,3.0f);
    }
    return t;
}

...takes about 6600ms to complete.

Update 2

The following program:

#include <math.h>

int main() {
    float t = 0;
    for (int i = 0; i < 1000000000; i++) {
        const float x = i;
        t += __builtin_powif(x,3.0f);
    }
    return t;
}

runs in 900ms like the first program.

Why isn't pow being inlined to __builtin_powif ?

Update 3:

With -ffast-math the following program:

#include <math.h>
#include <iostream>

int main() {
    float t = 0;
    for (int i = 0; i < 1'000'000'000; i++) {
            const float x = i;
            t += powf(x, 3.0f);
    }
    std::cout << t;
}

runs in 227ms (as does the x*x*x version). That's 200 picoseconds per iteration. Using -fopt-info it says optimized: loop vectorized using 16 byte vectors and optimized: loop with 2 iterations completely unrolled so I guess that means its doing iterations in batches of 4 for SSE and doing 2 iterations at once pipelining (for a total of 8 iterations at once), or something like that?

Andrew Tomazos
  • 58,923
  • 32
  • 156
  • 267

2 Answers2

3

The doc page about gcc builtins is explicit (emphasize mine):

Built-in Function: double __builtin_powi (double, int)

Returns the first argument raised to the power of the second. Unlike the pow function no guarantees about precision and rounding are made.

Built-in Function: float __builtin_powif (float, int)

Similar to __builtin_powi, except the argument and return types are float.

As __builtin_powif has equivalent performances to a a mere product, it means that the additional time is used to the controls required by pow for its guarantees about precision and rounding.

Serge Ballesta
  • 121,548
  • 10
  • 94
  • 199
  • Ok, so `x*x*x` and `powf(x,3)` will produce a different result for some values of x? That makes sense then I guess, if powf is more precise. – Andrew Tomazos Feb 11 '21 at 08:08
  • @AndrewTomazos: I would assume yes. The only real way to go further is to examine the source code of `pow`/`powf` in gcc library. – Serge Ballesta Feb 11 '21 at 08:11
  • Interestingly, this behavior changed in C++11. The libstdc++ definition of `std::pow` which uses `__builtin_powif` only happens in `#if __cplusplus < 201103L`. – Travis Gockel Feb 11 '21 at 08:12
  • @TravisGockel I belive that's due to [`std::pow` (7)](https://en.cppreference.com/w/cpp/numeric/math/pow) *"If any argument has integral type, it is cast to `double`"*. [Here](https://timsong-cpp.github.io/cppwp/cmath.syn) only the floating point overloads are listed. Also: https://stackoverflow.com/a/6626234/4944425 – Bob__ Feb 11 '21 at 10:00
0

% Assuming your compiler chose to just call pow in the shared library like https://godbolt.org/z/re3baK (without -ffast-math)

I did not take a look at how pow(float, float) is implemented, but I see some points.

  1. x*x*x is inlined while pow can't be as it is in a shared library - function call overhead difference
  2. Whether the exponent 3.0 is constant? If compiler know something is constant, it is likely to generate more efficient code
    • x*x*x : Just generates assembly for float value multiplication twice
    • pow : This must have considered all the exponent values so probably it has general code(less efficient, may include loops)
Hanjoung Lee
  • 1,923
  • 1
  • 9
  • 16
  • 1
    The compiler can inline or optimize `pow` because the C standard specifies how it behaves, and the external identifier `pow` is reserved for that purpose, so the compiler knows what it is required to do. Further, compilers and libraries may be bundled as part of one C implementation, so the implementation would have complete control over both the code generated by the compiler and the implementation of `pow`. – Eric Postpischil Feb 11 '21 at 11:53
  • Well, I was answering to the question "why there is 5x difference?". And with `-O3` (without `-ffast-math`) it really calls `pow` function and it is slow. https://godbolt.org/z/Kx71W7 Let me know if i missed something. – Hanjoung Lee Feb 11 '21 at 12:07
  • 1
    If a particular compiler does not inline `pow` in a particular circumstance, then the reason is not that it cannot be inlined because `pow` is in a shared library; the reason is that the compiler’s design does not inline pow. This can be complicated by other considerations, but it is not a requirement of the C standard. – Eric Postpischil Feb 11 '21 at 12:14
  • I get your point, and I'm not saying what I said is part of the standard. Just wanted to analyze the significant performance difference the OP asked. – Hanjoung Lee Feb 11 '21 at 14:35
  • 1
    The point is the statement is “`pow` can’t be as it is in a shared library” is false. Regardless of what you intended to analyze or convey, this answer presents false information and is therefore detrimental. In fact, [GCC and Clang do replace some calls to `pow` with inline code](https://godbolt.org/z/7PMEsq), including `pow(x, 3)` if `-ffast-math` is enabled. The second item, about constants, also misstates the situation; the code in the question has a hardcoded `3.0f`, so it is a constant; the text in this answer about having “general code” or consider all exponent values is inapplicable. – Eric Postpischil Feb 11 '21 at 14:43
  • Oh I get that and updated the answer. Thanks. – Hanjoung Lee Feb 11 '21 at 14:53