The following program...
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += x*x*x;
}
return t;
}
...takes about 900ms to complete on my machine. Whereas...
#include <cmath>
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += std::pow(x,3.0f);
}
return t;
}
...takes about 6600ms to complete.
I'm kind of suprised that the optimizer doesn't inline the std::pow function so that the two programs produce the same code and have identical performance.
Any insights? How do you account for the 5x performance difference?
For reference I'm using gcc -O3 on Linux x86
Update: (C Version)
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += x*x*x;
}
return t;
}
...takes about 900ms to complete on my machine. Whereas...
#include <math.h>
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += powf(x,3.0f);
}
return t;
}
...takes about 6600ms to complete.
Update 2
The following program:
#include <math.h>
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += __builtin_powif(x,3.0f);
}
return t;
}
runs in 900ms like the first program.
Why isn't pow
being inlined to __builtin_powif
?
Update 3:
With -ffast-math
the following program:
#include <math.h>
#include <iostream>
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += powf(x, 3.0f);
}
std::cout << t;
}
runs in 227ms (as does the x*x*x
version). That's 200 picoseconds per iteration. Using -fopt-info
it says optimized: loop vectorized using 16 byte vectors
and optimized: loop with 2 iterations completely unrolled
so I guess that means its doing iterations in batches of 4 for SSE and doing 2 iterations at once pipelining (for a total of 8 iterations at once), or something like that?