Using a LUT of function pointers forces the compiler to use that strategy. It could in theory compile the switch version to essentially the same code as the LUT version (now that you've added out-of-bounds checks to both). In practice, that's not what gcc or clang choose to do, so it's worth looking at the asm output to see what happened.
(update: gcc -fpie
(on by default on most modern Linux distros) likes to make tables of relative offsets, instead of absolute function pointers, so the rodata is position-independent, too. GCC Jump Table initialization code generating movsxd and add?. This could be a missed-optimization, see my answer there for links to gcc bug reports. Manually creating an array of function pointers could work around that.)
I put the code on the Godbolt compiler explorer with both functions in one compilation unit (with gcc and clang output), to see how it actually compiled. I expanded the functions a bit so it wasn't just two cases.
void fsm_switch(int state) {
switch(state) {
case FUNC0: func0(); break;
case FUNC1: func1(); break;
case FUNC2: func2(); break;
case FUNC3: func3(); break;
default: ;// Error handling
}
//prevent_tailcall();
}
void fsm_lut(state_e state) {
if (likely(state < FUNC_COUNT)) // without likely(), gcc puts the LUT on the taken side of this branch
lookUpTable[state]();
else
;// Error handling
//prevent_tailcall();
}
See also
How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?
x86
On x86, clang makes its own LUT for the switch, but the entries are pointers to within the function, not the final function pointers. So for clang-3.7, the switch happens to compile to code that is strictly worse than the manually-implemented LUT. Either way, x86 CPUs tend to have branch prediction that can handle indirect calls / jumps, at least if they're easy to predict.
GCC uses a sequence of conditional branches (but unfortunately doesn't tail-call directly with conditional branches, which AFAICT is safe on x86. It checks 1, <1, 2, 3, in that order, with mostly not-taken branches until it finds a match.
They make essentially identical code for the LUT: bounds check, zero the upper 32-bit of the arg register with a mov
, and then a memory-indirect jump with an indexed addressing mode.
ARM:
gcc 4.8.2 with -mcpu=cortex-m4 -O2
makes interesting code.
As Olaf said, it makes an inline table of 1B entries. It doesn't jump directly to the target function, but instead to a normal jump instruction (like b func3
). This is a normal unconditional jump, since it's a tail-call.
Each table destination entry needs significantly more code (Godbolt) if fsm_switch
does anything after the call (like in this case a non-inline function call, if void prevent_tailcall(void);
is declared but not defined), or if this is inlined into a larger function.
@@ With void prevent_tailcall(void){} defined so it can inline:
@@ Unlike in the godbolt link, this is doing tailcalls.
fsm_switch:
cmp r0, #3 @ state,
bhi .L5 @
tbb [pc, r0] @ state
@@ There's no section .rodata directive here: the table is in-line with the code, so there's no need for base pointer to be loaded into a reg. And apparently it's even loaded from I-cache, not D-cache
.byte (.L7-.L8)/2
.byte (.L9-.L8)/2
.byte (.L10-.L8)/2
.byte (.L11-.L8)/2
.L11:
b func3 @ optimized tail-call
.L10:
b func2
.L9:
b func1
.L7:
b func0
.L5:
bx lr @ This is ARM's equivalent of an x86 ret insn
IDK if there's much difference between how well branch prediction works for tbb
vs. a full-on indirect jump or call (blx
), on a lightweight ARM core. A data access to load the table might be more significant than the two-step jump to a branch instruction you get with a switch
.
I've read that indirect branches are poorly predicted on ARM. I'd hope it's not bad if the indirect branch has the same target every time. But if not, I'd assume most ARM cores won't find even short patterns the way big x86 cores will.
Instruction fetch/decode takes longer on x86, so it's more important to avoid bubbles in the instruction stream. This is one reason why x86 CPUs have such good branch prediction. Modern branch predictors even do a good job with patterns for indirect branches, based on history of that branch and/or other branches leading up to it.
The LUT function has to spend a couple instructions loading the base address of the LUT into a register, but otherwise is pretty much like x86:
fsm_lut:
cmp r0, #3 @ state,
bhi .L13 @,
movw r3, #:lower16:.LANCHOR0 @ tmp112,
movt r3, #:upper16:.LANCHOR0 @ tmp112,
ldr r3, [r3, r0, lsl #2] @ tmp113, lookUpTable
bx r3 @ indirect register sibling call @ tmp113
.L13:
bx lr @
@ in the .rodata section
lookUpTable:
.word func0
.word func1
.word func2
.word func3
See Mike of SST's answer for a similar analysis on a Microchip dsPIC.