The relevant part of the source (from a Godbolt link in a comment which you should really edit into your question) is:
const auto cnt = std::count_if(lookups.begin(), lookups.end(),[](const auto& val){
return buckets[hash_val(val)%16] == val;});
I didn't check the libstdc++ headers to see if count_if
is implemented with an if() { count++; }
, or if it uses a ternary to encourage branchless code. Probably a conditional. (The compiler can choose either, but a ternary is more likely to compile to a branchless cmovcc
or setcc
.)
It looks like gcc overestimated the cost of branchless for this code with generic tuning. -mtune=skylake
(implied by -march=skylake
) gives us branchless code for this regardless of -O2
vs. -O3
, or -fno-tree-vectorize
vs. -ftree-vectorize
. (On the Godbolt compiler explorer, I also put the count in a separate function that counts a vector<int>&
, so we don't have to wade through the timing and cout
code-gen in main
.)
- branchy code: gcc8.2
-O2
or -O3
, and O2/3 -march=haswell
or broadwell
- branchless code: gcc8.2
-O2/3 -march=skylake
.
That's weird. The branchless code it emits has the same cost on Broadwell vs. Skylake. I wondered if Skylake vs. Haswell was favouring branchless because of cheaper cmov
. GCC's internal cost model isn't always in terms of x86 instructions when its optimizing in the middle-end (in GIMPLE, an architecture-neutral representation). It doesn't yet know what x86 instructions would actually be used for a branchless sequence. So maybe a conditional-select operation is involved, and gcc models it as more expensive on Haswell, where cmov
is 2 uops? But I tested -march=broadwell
and still got branchy code. Hopefully we can rule that out assuming gcc's cost model knows that Broadwell (not Skylake) was the first Intel P6/SnB-family uarch to have single-uop cmov
, adc
, and sbb
(3-input integer ops).
I don't know what else about gcc's Skylake tuning option that makes it favour branchless code for this loop. Gather is efficient on Skylake, but gcc is auto-vectorizing (with vpgatherqd xmm
) even with -march=haswell
, where it doesn't look like a win because gather is expensive, and and requires 32x64 => 64-bit SIMD multiplies using 2x vpmuludq
per input vector. Maybe worth it with SKL, but I doubt HSW. Also probably a missed optimization not to pack back down to dword elements to gather twice as many elements with nearly the same throughput for vpgatherdd
.
I did rule out the function being less optimized because it was called main
(and marked cold
). It's generally recommended not to put your microbenchmarks in main
: compilers at least used to optimize main
differently (e.g. for code-size instead of just speed).
Clang does make it branchless even with just -O2
.
When compilers have to decide between branching and branchy, they have heuristics that guess which will be better. If they think it's highly predictable (e.g. probably mostly not-taken), that leans in favour of branchy.
In this case, the heuristic could have decided that out of all 2^32 possible values for an int
, finding exactly the value you're looking for is rare. The ==
may have fooled gcc into thinking it would be predictable.
Branchy can be better sometimes, depending on the loop, because it can break a data dependency. See gcc optimization flag -O3 makes code slower than -O2 for a case where it was very predictable, and the -O3
branchless code-gen was slower.
-O3
at least used to be more aggressive at if-conversion of conditionals into branchless sequences like cmp
; lea 1(%rbx), %rcx
; cmove %rcx, %rbx
, or in this case more likely xor
-zero / cmp
/ sete
/ add
. (Actually gcc -march=skylake
uses sete
/ movzx
, which is pretty much strictly worse.)
Without any runtime profiling / instrumentation data, these guesses can easily be wrong. Stuff like this is where Profile Guided Optimization shines. Compile with -fprofile-generate
, run it, then compiler with -fprofile-use
, and you'll probably get branchless code.
BTW, -O3
is generally recommended these days. Is optimisation level -O3 dangerous in g++?. It does not enable -funroll-loops
by default, so it only bloats code when it auto-vectorizes (especially with very large fully-unrolled scalar prologue/epilogue around a tiny SIMD loop that bottlenecks on loop overhead. /facepalm.)