38

In another thread, I was told that a switch may be better than a lookup table in terms of speed and compactness.

So I'd like to understand the differences between this:

Lookup table

static void func1(){}
static void func2(){}

typedef enum
{
    FUNC1,
    FUNC2,
    FUNC_COUNT
} state_e;

typedef void (*func_t)(void);

const func_t lookUpTable[FUNC_COUNT] =
{
    [FUNC1] = &func1,
    [FUNC2] = &func2
};

void fsm(state_e state)
{
    if (state < FUNC_COUNT) 
        lookUpTable[state]();
    else
        ;// Error handling
}

and this:

Switch

static void func1(){}
static void func2(){}

void fsm(int state)
{
    switch(state)
    {
        case FUNC1: func1(); break;
        case FUNC2: func2(); break;
        default:    ;// Error handling
    }
}

I thought that a lookup table was faster since compilers try to transform switch statements into jump tables when possible. Since this may be wrong, I'd like to know why!

Thanks for your help!

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
Plouff
  • 2,881
  • 2
  • 24
  • 43
  • 14
    We can't tell you an answer to this, as it depends on too many things, but mostly the compiler you're using. Instead, you should instruct your compiler to output the assembly in both cases, while using optimization flags, and compare it yourself. – nos Mar 07 '16 at 08:05
  • 2
    You should have a look to the following post about switch statements: http://lazarenko.me/switch/ – Guillaume George Mar 07 '16 at 08:07
  • @nos: That's my point actually. I mean I thought that no matter the compiler, the switch would be slower! In my initial question, I forgot to say that my state variable as continuous values (it's an enum). I updated my question to add that. – Plouff Mar 07 '16 at 08:40
  • @GuillaumeGeorge: Thanks for the link. When I'll read it, I hope it will give me a new perspective on the problem! – Plouff Mar 07 '16 at 08:42
  • 3
    Some compilers transform simple switch-statements to lookup tables. A general answer is really not possible. – Sebastian Mach Mar 07 '16 at 14:33
  • @phresnel: I couldn't phrase it like that at the beginning, but I things like hardware mechanism in my mind. The answers below, helped me to understand why there is no rule of the thumb. But again, I could really phrase it like that! – Plouff Mar 07 '16 at 17:03
  • @phresnel: This is practice since >20 years in compilers. Still, there are diferences, e.g. where the table is placed and how it is accessed (instruction or data fetch). These are very important differences for embedded systems (see my answer). It ios much different from a PC-like systems (which includes most ARM Cortex-A based systems). – too honest for this site Mar 07 '16 at 17:53
  • 1
    @Olaf: Not sure why you are teaching me that this is practice since >20 years, especially since it's not universally true and does not happen for every switch and with all optimization flags. It also depends on cost/benefit-heuristics. Not every LUT is an optimization, likewise, not every hardcoded if-else structure is. – Sebastian Mach Mar 08 '16 at 11:51
  • @phresnel: I did not say it is done for every `switch`; of course it depends on the labels. Maybe I should have used "common practice" or "state of the art", but that's what it is. Anyway, if your compiler does not do it for a simple `switch` with increasing (by one) labels, you should get a modern one. There are still expensive rubbish compilers, notably in the embedded field. Anyway, that was not the point of my comment! – too honest for this site Mar 08 '16 at 14:08
  • @Olaf: Though other optimizations exist. I am not sure about sparse lookup tables, but binary trees have been seen. For example http://programming.sirrida.de/hashsuper.pdf (you've probably seen that). Anyways, getting a tad too long for comments :) – Sebastian Mach Mar 08 '16 at 14:36
  • @phresnel: ... and still missing the point. – too honest for this site Mar 08 '16 at 14:37
  • 2
    A lot of compilers can't inline a function pointer call (or may require multiple implementation specific options) and thus miss any optimizations that would go along with inlining... just something to keep in mind. – technosaurus Mar 10 '16 at 05:21
  • Switches, or generally executable code over lookup tables, should be faster. The paper behind the re2c scanner generator presents some empirical research on that http://re2c.org/_downloads/1994_bumbulis_cowan_re2c_a_more_versatile_scanner_generator.pdf – PSkocik Oct 17 '16 at 09:42

6 Answers6

23

As I was the original author of the comment, I have to add a very important issue you did not mention in your question. That is, the original was about an embedded system. Presuming this is a typical bare-metal system with integrated Flash, there are very important differences from a PC on which I will concentrate.

Such embedded systems typically have the following constraints.

  • no CPU cache.
  • Flash requires waitstates for higher (i.e. >ca. 32MHz) CPU clocks. The actual ratio depends on the die design, low power/high speed process, operating voltage, etc.
  • To hide waitstates, Flash has wider read-lines than the CPU-bus.
  • This only works well for linear code with instruction prefetch.
  • Data accesses disturb instruction prefetch or are stalled until it finished.
  • Flash might have an internal very small instruction cache.
  • If any at all, there is an even smaller data-cache.
  • The small caches result in more frequent trashing (replacing a previous entry before that has been used another time).

For e.g. the STM32F4xx a read takes 6 clocks at 150MHz/3.3V for 128 bits (4 words). So if a data-access is required, chances are good it adds more than 12 clocks delay for all data to be fetched (there are additional cycles involved).

Presuming compact state-codes, for the actual problem, this has the following effects on this architecture (Cortex-M4):

  • Lookup-table: Reading the function address is a data-access. With all implications mentioned above.
  • A switch otoh uses a special "table-lookup" instruction which uses code-space data right behind the instruction. So the first entries are possibly already prefetched. Other entries don't break the prefetch. Also the access is a code-acces, thus the data goes into the Flash's instruction cache.

Also note that the switch does not need functions, thus the compiler can fully optimise the code. This is not possible for a lookup table. At least code for function entry/exit is not required.


Due to the aforementioned and other factors, an estimate is hard to tell. It heavily depends on your platform and the code structure. But assuming the system given above, the switch is very likely faster (and clearer, btw.).

too honest for this site
  • 11,417
  • 3
  • 27
  • 49
  • Your answer is indeed more relevant since I was talking about embedded software. My target is not a STM32, but this is a MCU. I takes 8 cycles to achieve a read on the FLASH. Unfortunately, I prefer to use functions even if I would rework the code to a `switch`. So I must also consider the readability of the solution. Apart from effectiveness perspective, I tend to prefer the readability of the lookup table. But this is a matter of taste! Thanks for your very detailed answer (I gave you the answer mark :)! – Plouff Mar 07 '16 at 15:30
  • Subsidiary question: this kind of behavior is not directly related to the assembly right? I mean, is instrumenting the code (like toggling GPIOs) the only solution to see the effects of those hw mechanism? Thanks! – Plouff Mar 07 '16 at 15:34
  • 2
    @Plouff: I'm not sure what you mean with the last comment. It certainly is an assembly/implementation detail, as always when it comes to performance. I hope I made clear there is a bunch of factors involved. About using functions: If you use a modern compiler (e.g. gcc), declare the functions `static`, it may very well inline them into a `switch` (depends on optimisation settings, too). Adding `inline`, **can** give the compiler an even stronger hint (but not necessarily). Not sure how more "conservative" compilers like IAR behave (they sometimes tend to optimise such constructs worse). – too honest for this site Mar 07 '16 at 15:41
  • I mean that I don't understand how I could see the effect of the waitstates of the flash in the assembly. Is it a correct assumption? I read stuff about `inline` function in the past. But I remember that it does not imply that the compiler will actually inline the function. More over, `inline` is C99 feature. For all this reasons, I don't use it much... It may be the time to give it a try. Thanks for the hint! (@name feature broken?!) – Plouff Mar 07 '16 at 16:59
  • `inline` is a standard C feature since C99, current version is C11. There is only **one** C standard, so when talking about C, it is C11. C99 is downwards compatible; the differnces don't matter here. You should not use an outdated compiler which does not support at least C99 – It is now 5 years superseeded by C11 and 17 years since its release! Please read my comment carefully. I wrote about modern compilers, not some rubbish – yet expensive – proprietary toolchain. If you stated which MCU you use, I might have been able to provide some more hints. – too honest for this site Mar 07 '16 at 17:45
  • At the moment I am on TI C2000 targets. But, without naming them, it happens that I must work with older MCUs whose toolchain only supports C89!! (I think I can't @name you since you are the only one commenting here!) – Plouff Mar 08 '16 at 07:55
  • @Plouff: I remember from an university lab back in the 90ies where we used a TI MCU (not sure which one) where we actually saw such optimisations. Just checked. The C2000 Flash actually does have waitstates. Not sure about caches, etc, though. I'd benchmark if**f** that is really an issue. – too honest for this site Mar 08 '16 at 22:54
  • Yes the C2000 has wait states (8 in my application). Also, they don't have cache but a prefetch buffer. Anyway, like you said there is no issue *at the moment*, so I don't need to benchmark *at the moment*. but we plan to add more and more features so it may be a problem later. And I prefer to anticipate those problems with potential solutions. Thanks again – Plouff Mar 09 '16 at 07:57
17

First, on some processors, indirect calls (e.g. through a pointer) - like those in your Lookup Table example - are costly (pipeline breakage, TLB, cache effects). It might also be true for indirect jumps...

Then, a good optimizing compiler might inline the call to func1() in your Switch example; then you won't run any prologue or epilogue for an inlined functions.

You need to benchmark to be sure, since a lot of other factors matter on the performance. See also this (and the reference there).

Community
  • 1
  • 1
Basile Starynkevitch
  • 1
  • 16
  • 251
  • 479
  • Thanks exactly the answer I was lookup for! I had no idea that indirect calls would lead to stuff like pipeline breakage, TLB and cache effects. But now, I need to figure out what it means... I had one question at the beginning and now I have 3 more! Thanks ;)! – Plouff Mar 07 '16 at 08:45
  • @BarryTheHatchet: I am talking about this one: http://stackoverflow.com/q/35797254/882697 . I don't think you commented here. Which thread are you talking about? It might be interesting for me :). – Plouff Mar 07 '16 at 12:47
  • @Plouff Okay must have been another post then. Co-incidence! – Lightness Races in Orbit Mar 07 '16 at 12:51
4

Using a LUT of function pointers forces the compiler to use that strategy. It could in theory compile the switch version to essentially the same code as the LUT version (now that you've added out-of-bounds checks to both). In practice, that's not what gcc or clang choose to do, so it's worth looking at the asm output to see what happened.

(update: gcc -fpie (on by default on most modern Linux distros) likes to make tables of relative offsets, instead of absolute function pointers, so the rodata is position-independent, too. GCC Jump Table initialization code generating movsxd and add?. This could be a missed-optimization, see my answer there for links to gcc bug reports. Manually creating an array of function pointers could work around that.)


I put the code on the Godbolt compiler explorer with both functions in one compilation unit (with gcc and clang output), to see how it actually compiled. I expanded the functions a bit so it wasn't just two cases.

void fsm_switch(int state) {
    switch(state) {
        case FUNC0: func0(); break;
        case FUNC1: func1(); break;
        case FUNC2: func2(); break;
        case FUNC3: func3(); break;
        default:    ;// Error handling
    }
    //prevent_tailcall();
}

void fsm_lut(state_e state) {
    if (likely(state < FUNC_COUNT))  // without likely(), gcc puts the LUT on the taken side of this branch
        lookUpTable[state]();
    else
        ;// Error handling
    //prevent_tailcall();
}

See also How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?


x86

On x86, clang makes its own LUT for the switch, but the entries are pointers to within the function, not the final function pointers. So for clang-3.7, the switch happens to compile to code that is strictly worse than the manually-implemented LUT. Either way, x86 CPUs tend to have branch prediction that can handle indirect calls / jumps, at least if they're easy to predict.

GCC uses a sequence of conditional branches (but unfortunately doesn't tail-call directly with conditional branches, which AFAICT is safe on x86. It checks 1, <1, 2, 3, in that order, with mostly not-taken branches until it finds a match.

They make essentially identical code for the LUT: bounds check, zero the upper 32-bit of the arg register with a mov, and then a memory-indirect jump with an indexed addressing mode.


ARM:

gcc 4.8.2 with -mcpu=cortex-m4 -O2 makes interesting code.

As Olaf said, it makes an inline table of 1B entries. It doesn't jump directly to the target function, but instead to a normal jump instruction (like b func3). This is a normal unconditional jump, since it's a tail-call.

Each table destination entry needs significantly more code (Godbolt) if fsm_switch does anything after the call (like in this case a non-inline function call, if void prevent_tailcall(void); is declared but not defined), or if this is inlined into a larger function.

@@ With   void prevent_tailcall(void){} defined so it can inline:
@@ Unlike in the godbolt link, this is doing tailcalls.
fsm_switch:
        cmp     r0, #3    @ state,
        bhi     .L5       @
        tbb     [pc, r0]  @ state
       @@ There's no section .rodata directive here: the table is in-line with the code, so there's no need for base pointer to be loaded into a reg.  And apparently it's even loaded from I-cache, not D-cache
        .byte   (.L7-.L8)/2
        .byte   (.L9-.L8)/2
        .byte   (.L10-.L8)/2
        .byte   (.L11-.L8)/2
.L11:
        b       func3     @ optimized tail-call
.L10:
        b       func2
.L9:
        b       func1
.L7:
        b       func0
.L5:
        bx      lr         @ This is ARM's equivalent of an x86 ret insn

IDK if there's much difference between how well branch prediction works for tbb vs. a full-on indirect jump or call (blx), on a lightweight ARM core. A data access to load the table might be more significant than the two-step jump to a branch instruction you get with a switch.

I've read that indirect branches are poorly predicted on ARM. I'd hope it's not bad if the indirect branch has the same target every time. But if not, I'd assume most ARM cores won't find even short patterns the way big x86 cores will.

Instruction fetch/decode takes longer on x86, so it's more important to avoid bubbles in the instruction stream. This is one reason why x86 CPUs have such good branch prediction. Modern branch predictors even do a good job with patterns for indirect branches, based on history of that branch and/or other branches leading up to it.

The LUT function has to spend a couple instructions loading the base address of the LUT into a register, but otherwise is pretty much like x86:

fsm_lut:
        cmp     r0, #3    @ state,
        bhi     .L13      @,
        movw    r3, #:lower16:.LANCHOR0 @ tmp112,
        movt    r3, #:upper16:.LANCHOR0 @ tmp112,
        ldr     r3, [r3, r0, lsl #2]      @ tmp113, lookUpTable
        bx      r3  @ indirect register sibling call    @ tmp113
.L13:
        bx      lr  @

@ in the .rodata section
lookUpTable:
        .word   func0
        .word   func1
        .word   func2
        .word   func3

See Mike of SST's answer for a similar analysis on a Microchip dsPIC.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • 1
    That's another great answer, thank you so much! I wondered if everything could be seen with the assembly. Now I know that you have to know how your hardware will handle the assembly you feed it with! So you answer to another question I had: the answer to this question depends not only on the compiler you use but also on the hardware. So the only fully effective solution is benchmarking and not code analysis (unless you are absolutely aware of all the hw mechanisms). Thanks again! – Plouff Mar 15 '16 at 08:05
  • 1
    I'd like to +1 you again for the discovery of likely(), and tall call effects! – Plouff Mar 15 '16 at 08:06
  • @Plouff: actually, looking at the asm and being aware of what's slow across a range of hardware families can sort of substitute for benchmarks. Most people don't have one of every x86 uarch lying around in a benchmark farm. Although for embedded, you can of course just benchmark on your target platform. – Peter Cordes Mar 16 '16 at 00:10
  • Yes, that is what I tried to say :). I my case, benchmarking will be the only solution unless I take a lot of time studying the architecture of my target! – Plouff Mar 16 '16 at 07:40
  • 1
    On many platforms based on a slow flash code store, manual use of a table will allow one control over whether the table is stored in fast RAM or slow flash. I don't think I've ever seen a compiler with options to control switch-statement code generation in such fashion. – supercat Sep 07 '18 at 19:12
3

msc's answer and the comments give you good hints as to why performance may not be what you expect. Benchmarking is the rule, but results will vary from one architecture to another, and may change with other versions of the compiler and of course its configuration and options selected.

Note however that your 2 pieces of code do not perform the same validation on state:

  • The switch will gracefully do nothing is state is not one of the defined values,
  • The jump table version will invoke undefined behavior for all but the 2 values FUNC1 and FUNC2.

There is no generic way to initialize the jump table with dummy function pointers without making assumptions on FUNC_COUNT. Do get the same behavior, the jump table version should look like this:

void fsm(int state) {
    if (state >= 0 && state < FUNC_COUNT && lookUpTable[state] != NULL)
        lookUpTable[state]();
}

Try benchmarking this and inspect the assembly code. Here is a handy online compiler for this: http://gcc.godbolt.org/#

chqrlie
  • 98,886
  • 10
  • 89
  • 149
  • In this question, I wanted to focus on the lookup table vs the switch. But you are right: the state validation has its cost too! I added some details, to better reflect the implementation I use. So far this is interesting to notice that in the 3 answers I had there is always the same advice: benchmark! Thanks for the online compiler this will be a very useful link!! – Plouff Mar 07 '16 at 08:32
  • 1
    In a proper implementation (i.e. including catching invalid states and a compact switch), both will result in quite a similar construct. Details are more in the hardware architecture, how data is read from Flash, access types, etc. – too honest for this site Mar 07 '16 at 13:55
3

On the Microchip dsPIC family of devices a look-up table is stored as a set of instruction addresses in the Flash itself. Performing the look-up involves reading the address from the Flash then calling the routine. Making the call adds another handful of cycles to push the instruction pointer and other bits and bobs (e.g. setting the stack frame) of housekeeping.

For example, on the dsPIC33E512MU810, using XC16 (v1.24) the look-up code:

lookUpTable[state]();

Compiles to (from the disassembly window in MPLAB-X):

!        lookUpTable[state]();
0x2D20: MOV [W14], W4    ; get state from stack-frame (not counted)
0x2D22: ADD W4, W4, W5   ; 1 cycle (addresses are 16 bit aligned)
0x2D24: MOV #0xA238, W4  ; 1 cycle (get base address of look-up table)
0x2D26: ADD W5, W4, W4   ; 1 cycle (get address of entry in table)
0x2D28: MOV [W4], W4     ; 1 cycle (get address of the function)
0x2D2A: CALL W4          ; 2 cycles (push PC+2 set PC=W4)

... and each (empty, do-nothing) function compiles to:

!static void func1()
!{}
0x2D0A: LNK #0x0         ; 1 cycle (set up stack frame)
! Function body goes here
0x2D0C: ULNK             ; 1 cycle (un-link frame pointer)
0x2D0E: RETURN           ; 3 cycles

This is a total of 11 instruction cycles of overhead for any of the cases, and they all take the same. (Note: If either the table or the functions it contains are not in the same 32K program word Flash page, there will be an even greater overhead due to having to get the Address Generation Unit to read from the correct page, or to set up the PC to make a long call.)

On the other hand, providing that the whole switch statement fits within a certain size, the compiler will generate code that does a test and relative branch as two instructions per case taking three (or possibly four) cycles per case up to the one that's true.

For example, the switch statement:

switch(state)
{
case FUNC1: state++; break;
case FUNC2: state--; break;
default: break;
}

Compiles to:

!    switch(state)
0x2D2C: MOV [W14], W4       ; get state from stack-frame (not counted)
0x2D2E: SUB W4, #0x0, [W15] ; 1 cycle (compare with first case)
0x2D30: BRA Z, 0x2D38       ; 1 cycle (if branch not taken, or 2 if it is)
0x2D32: SUB W4, #0x1, [W15] ; 1 cycle (compare with second case)
0x2D34: BRA Z, 0x2D3C       ; 1 cycle (if branch not taken, or 2 if it is)
!    {
!    case FUNC1: state++; break;
0x2D38: INC [W14], [W14]    ; To stop the switch being optimised out
0x2D3A: BRA 0x2D40          ; 2 cycles (go to end of switch)
!    case FUNC2: state--; break;
0x2D3C: DEC [W14], [W14]    ; To stop the switch being optimised out
0x2D3E: NOP                 ; compiler did a fall-through (for some reason)
!    default: break;
0x2D36: BRA 0x2D40          ; 2 cycles (go to end of switch)
!    }

This is an overhead of 5 cycles if the first case is taken, 7 if the second case is taken, etc., meaning they break even on the fourth case.

This means that knowing your data at design time will have a significant influence on the long-term speed. If you have a significant number (more than about 4 cases) and they all occur with similar frequency then a look-up table will be quicker in the long run. If the frequency of the cases is significantly different (e.g. case 1 is more likely than case 2, which is more likely than case 3, etc.) then, if you order the switch with the most likely case first, then the switch will be faster in the long run. For the edge case when you only have a few cases the switch will (probably) be faster anyway for most executions and is more readable and less error prone.

If there are only a few cases in the switch, or some cases will occur more often than others, then doing the test and branch of the switch will probably take fewer cycles than using a look-up table. On the other hand, if you have more than a handful of cases of that occur with similar frequency then the look-up will probably end up being faster on average.

Tip: Go with the switch unless you know the look-up will definitely be faster and the time it takes to run is important.

Edit: My switch example is a little unfair, as I've ignored the original question and in-lined the 'body' of the cases to highlight the real advantage of using a switch over a look-up. If the switch has to do the call as well then it only has the advantage for the first case!

Evil Dog Pie
  • 2,182
  • 1
  • 17
  • 43
  • Thank you for this case study. At the moment I have about 20 states (ie cases). It might be more in the future. The only issue I see with your answer is that the switch is not translated into a jump table by the compiler. That would be nice. More over, like you said you don't have the subroutine call overhead in the `switch` but the examples give a good idea anyway. Thanks again! – Plouff Mar 08 '16 at 07:42
  • 1
    @Plouff You're correct, the Microchip compiler does not translate switch statements into a jump table, simply because the two-instruction test and branch sequence is more efficient. This gives the developer a choice of solutions based on their requirements. – Evil Dog Pie Mar 08 '16 at 13:16
  • @Plouff I chose to ignore the calls in the cases because your example code implies that the use case is a state machine. For these, I would usually try to keep the state transition handling inline in the switch statement cases (so that the state management is encapsulated) and delegate specific state transition handling to other functions as necessary. This also allows the code to take advantage of the (slim) performance gains of the switch over the jump table where possible. That's all very subjective though and I've no doubt your project has different requirements from mine. :-) – Evil Dog Pie Mar 08 '16 at 13:20
  • Thanks for the details. Would you say that about 20 states is a lot of states? – Plouff Mar 08 '16 at 14:15
  • 1
    Generally, yes. But if one or two states occur significantly more often than others, I'd expect the switch to be a little more efficient. However, the choice between jump table and switch depends largely on the compiler and your target processor. You can usually get the compiler to output an 'intermediate' file that contains the compiled assembly language instructions. It would be worth having a look at that and comparing the two. But unless you know for certain that it needs optimisation your time is probably better spend on other things, as the difference between the two is very small. – Evil Dog Pie Mar 09 '16 at 10:06
  • Ok, thanks for your feedback :). Like I said somewhere, I don't really this kind of optimizations now. But I may need it in the future. – Plouff Mar 09 '16 at 10:52
2

To have even more compiler outputs, here what is produced by the TI C28x compiler using @PeterCordes sample code:

_fsm_switch:
        CMPB      AL,#0                 ; [CPU_] |62| 
        BF        $C$L3,EQ              ; [CPU_] |62| 
        ; branchcc occurs ; [] |62| 
        CMPB      AL,#1                 ; [CPU_] |62| 
        BF        $C$L2,EQ              ; [CPU_] |62| 
        ; branchcc occurs ; [] |62| 
        CMPB      AL,#2                 ; [CPU_] |62| 
        BF        $C$L1,EQ              ; [CPU_] |62| 
        ; branchcc occurs ; [] |62| 
        CMPB      AL,#3                 ; [CPU_] |62| 
        BF        $C$L4,NEQ             ; [CPU_] |62| 
        ; branchcc occurs ; [] |62| 
        LCR       #_func3               ; [CPU_] |66| 
        ; call occurs [#_func3] ; [] |66| 
        B         $C$L4,UNC             ; [CPU_] |66| 
        ; branch occurs ; [] |66| 
$C$L1:    
        LCR       #_func2               ; [CPU_] |65| 
        ; call occurs [#_func2] ; [] |65| 
        B         $C$L4,UNC             ; [CPU_] |65| 
        ; branch occurs ; [] |65| 
$C$L2:    
        LCR       #_func1               ; [CPU_] |64| 
        ; call occurs [#_func1] ; [] |64| 
        B         $C$L4,UNC             ; [CPU_] |64| 
        ; branch occurs ; [] |64| 
$C$L3:    
        LCR       #_func0               ; [CPU_] |63| 
        ; call occurs [#_func0] ; [] |63| 
$C$L4:    
        LCR       #_prevent_tailcall    ; [CPU_] |69| 
        ; call occurs [#_prevent_tailcall] ; [] |69| 
        LRETR     ; [CPU_] 
        ; return occurs ; [] 



_fsm_lut:
;* AL    assigned to _state
        CMPB      AL,#4                 ; [CPU_] |84| 
        BF        $C$L5,HIS             ; [CPU_] |84| 
        ; branchcc occurs ; [] |84| 
        CLRC      SXM                   ; [CPU_] 
        MOVL      XAR4,#_lookUpTable    ; [CPU_U] |85| 
        MOV       ACC,AL << 1           ; [CPU_] |85| 
        ADDL      XAR4,ACC              ; [CPU_] |85| 
        MOVL      XAR7,*+XAR4[0]        ; [CPU_] |85| 
        LCR       *XAR7                 ; [CPU_] |85| 
        ; call occurs [XAR7] ; [] |85| 
$C$L5:    
        LCR       #_prevent_tailcall    ; [CPU_] |88| 
        ; call occurs [#_prevent_tailcall] ; [] |88| 
        LRETR     ; [CPU_] 
        ; return occurs ; [] 

I also used -O2 optimizations. We can see that the switch is not converted into a jump table even if the compiler has the ability.

Plouff
  • 2,881
  • 2
  • 24
  • 43