126

I have an embedded application with a time-critical ISR that needs to iterate through an array of size 256 (preferably 1024, but 256 is the minimum) and check if a value matches the arrays contents. A bool will be set to true is this is the case.

The microcontroller is an NXP LPC4357, ARM Cortex M4 core, and the compiler is GCC. I already have combined optimisation level 2 (3 is slower) and placing the function in RAM instead of flash. I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i!=0 is faster than checking if i<256). All in all, I end up with a duration of 12.5 µs which has to be reduced drastically to be feasible. This is the (pseudo) code I use now:

uint32_t i;
uint32_t *array_ptr = &theArray[0];
uint32_t compareVal = 0x1234ABCD;
bool validFlag = false;

for (i=256; i!=0; i--)
{
    if (compareVal == *array_ptr++)
    {
         validFlag = true;
         break;
     }
}

What would be the absolute fastest way to do this? Using inline assembly is allowed. Other 'less elegant' tricks are also allowed.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
wlamers
  • 1,266
  • 2
  • 10
  • 14
  • 5
    You will definitely get a faster solution writing it in assembly language. You can gain speed with 3 ways: loop unrolling, cache prefetch and using "load-multiple" instructions. The first 2 could potentially be done in C, but not the last. I never trust C compilers to do the "right" thing and I'm rarely surprised. – BitBank Sep 04 '14 at 09:39
  • 28
    Is there any way to store the value in the array differently? If you can have them sorted, a binary search will surely be faster. If data to be stored and searched are within a certain range, they might be representable with a bit map, etc. – Remo.D Sep 04 '14 at 09:55
  • 20
    @BitBank: you'd be surpised how much compilers have improved in the last three decades. ARM expecially is quite compiler-friendly. And I know for a fact that ARM on GCC can issue load-multiple instructions (since 2009 at least) – MSalters Sep 04 '14 at 11:44
  • Did your compiler unroll the loop? If not, have you tried doing that manually and measuring it? – Useless Sep 04 '14 at 14:03
  • 5
    Thanks all. No the compiler does not unroll. In fact I solved the issue using a binary search (values in array sorted on forehand). – wlamers Sep 04 '14 at 14:19
  • An important piece of information not expressed in the question is whether the array is under software control or generated by hardware (DMA, etc). The obvious optimization is to change the array if this is under your control. – artless noise Sep 04 '14 at 15:22
  • 8
    awesome question, people forget there are real world cases where performance matters. too many times questions like this are answered with "just use stl" – Kik Sep 04 '14 at 15:49
  • 14
    The title "... iterate through an array" is misleading since indeed you are simply searching for a given value. To iterate over an array implies something is to be done on each entry. Sorting, if the cost can be amortized over many searches, is indeed an efficient approach independent of the language implementation issues. – hardmath Sep 04 '14 at 15:51
  • 2
    A binary search over a sorted array is likely to be much cheaper than any linear scan of 256 entries. – Ira Baxter Sep 04 '14 at 19:10
  • 8
    Are you sure that you cannot simply use a binary search or a hash table? A binary search for 256 items == 8 comparisons. A hash table == 1 jump on average (or 1 jump *max* if you have a perfect hash). You should resort to assembly optimization only after you 1) have a decent searching algorithm (`O(1)` or `O(logN)`, compared to `O(N)`), and 2) you have profiled it to be the bottleneck. – Groo Sep 04 '14 at 21:02
  • Note: Before spending the energy to optimize this, do performance analysis to find out what percentage of the actual runtime it's consuming. Infinite speedup of 1% of the execution time takes infinite effort and yields a 1% net improvement. 10% speedup of 10% of the program is a lot easier to achieve and yields the same benefit. Initially, changing high-level algorithms will generally give you better bang for the buck than trying to tweak individual instructions, or even individual subroutines. – keshlam Sep 04 '14 at 21:50
  • @keshlam: A lot will depend upon the nature of the data being examined. If the sequence of data in the array is semantically significant, and if the data changes frequently, trying to maintain parallel data structures to optimize searching may be counterproductive. – supercat Sep 04 '14 at 23:11
  • @supercat: That's what I just said: Don't optimize blind. Understand the data, understand how the data is being used, understand how much use of this data actually affects your performance, THEN decide whether this is what you want to spend your time optimizing -- and be sure to measure before/during/after both to focus correctly and to decide whether your change is actually an improvement. Especially in Java; JIT of large applications is nondeterministic! – keshlam Sep 05 '14 at 00:09
  • Naive answer: what about the asm equivalent of `if (compareVal == array[0]) return true; if(compareVal == array[1]) return true; etc...` ? – ignis Sep 05 '14 at 05:59
  • How do I inline assembler in pseudo code? – Thomas Weller Sep 05 '14 at 15:09
  • @keshlam: He did that. This array search is all the ISR does, it takes 12.5 µs, and needs to be faster. I agree that in Java, you've got things like GC preventing deterministic performance, but here, you really do have this amount of control and sometimes it really is worth diving into it. It's fun, too...I work in Python now, but I miss this stuff :) – Vanessa Phipps Sep 05 '14 at 17:46
  • 1
    You have posted a classical [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) where X is "must be faster" and Y is "therefore I need assembler. There are plenty of answers that correctly point out that you've got an algorithmically over-complicated problem. But you got stuck on the Y-problem (assembly) ignoring algorithmic improvements. The next person who has to maintain your unnecessary assembly will curse your name. Assembler isn't as macho as it once was. Correct, readable, maintainable are the new studly. – msw Sep 07 '14 at 05:26
  • 1
    @Kik: Nobody forgets that performance matters. Unfortunately, everybody seems to have forgotten that _evidence_ matters. None of the answers below present measurements; NONE of them! – Lightness Races in Orbit Sep 07 '14 at 12:23
  • *people forget there are real world cases where performance matters* -- No they don't. *too many times questions like this are answered with "just use stl"* -- STL is highly efficient but it's not a valid answer here because this is a C question. – Jim Balter Oct 01 '14 at 06:53
  • What do the acronyms ISR and MCU stand for? – Burhan Ali Jan 21 '15 at 13:55

15 Answers15

106

In situations where performance is of utmost importance, the C compiler will most likely not produce the fastest code compared to what you can do with hand tuned assembly language. I tend to take the path of least resistance - for small routines like this, I just write asm code and have a good idea how many cycles it will take to execute. You may be able to fiddle with the C code and get the compiler to generate good output, but you may end up wasting lots of time tuning the output that way. Compilers (especially from Microsoft) have come a long way in the last few years, but they are still not as smart as the compiler between your ears because you're working on your specific situation and not just a general case. The compiler may not make use of certain instructions (e.g. LDM) that can speed this up, and it's unlikely to be smart enough to unroll the loop. Here's a way to do it which incorporates the 3 ideas I mentioned in my comment: Loop unrolling, cache prefetch and making use of the multiple load (ldm) instruction. The instruction cycle count comes out to about 3 clocks per array element, but this doesn't take into account memory delays.

Theory of operation: ARM's CPU design executes most instructions in one clock cycle, but the instructions are executed in a pipeline. C compilers will try to eliminate the pipeline delays by interleaving other instructions in between. When presented with a tight loop like the original C code, the compiler will have a hard time hiding the delays because the value read from memory must be immediately compared. My code below alternates between 2 sets of 4 registers to significantly reduce the delays of the memory itself and the pipeline fetching the data. In general, when working with large data sets and your code doesn't make use of most or all of the available registers, then you're not getting maximum performance.

; r0 = count, r1 = source ptr, r2 = comparison value

   stmfd sp!,{r4-r11}   ; save non-volatile registers
   mov r3,r0,LSR #3     ; loop count = total count / 8
   pld [r1,#128]
   ldmia r1!,{r4-r7}    ; pre load first set
loop_top:
   pld [r1,#128]
   ldmia r1!,{r8-r11}   ; pre load second set
   cmp r4,r2            ; search for match
   cmpne r5,r2          ; use conditional execution to avoid extra branch instructions
   cmpne r6,r2
   cmpne r7,r2
   beq found_it
   ldmia r1!,{r4-r7}    ; use 2 sets of registers to hide load delays
   cmp r8,r2
   cmpne r9,r2
   cmpne r10,r2
   cmpne r11,r2
   beq found_it
   subs r3,r3,#1        ; decrement loop count
   bne loop_top
   mov r0,#0            ; return value = false (not found)
   ldmia sp!,{r4-r11}   ; restore non-volatile registers
   bx lr                ; return
found_it:
   mov r0,#1            ; return true
   ldmia sp!,{r4-r11}
   bx lr

Update: There are a lot of skeptics in the comments who think that my experience is anecdotal/worthless and require proof. I used GCC 4.8 (from the Android NDK 9C) to generate the following output with optimization -O2 (all optimizations turned on including loop unrolling). I compiled the original C code presented in the question above. Here's what GCC produced:

.L9: cmp r3, r0
     beq .L8
.L3: ldr r2, [r3, #4]!
     cmp r2, r1
     bne .L9
     mov r0, #1
.L2: add sp, sp, #1024
     bx  lr
.L8: mov r0, #0
     b .L2

GCC's output not only doesn't unroll the loop, but also wastes a clock on a stall after the LDR. It requires at least 8 clocks per array element. It does a good job of using the address to know when to exit the loop, but all of the magical things compilers are capable of doing are nowhere to be found in this code. I haven't run the code on the target platform (I don't own one), but anyone experienced in ARM code performance can see that my code is faster.

Update 2: I gave Microsoft's Visual Studio 2013 SP2 a chance to do better with the code. It was able to use NEON instructions to vectorize my array initialization, but the linear value search as written by the OP came out similar to what GCC generated (I renamed the labels to make it more readable):

loop_top:
   ldr  r3,[r1],#4  
   cmp  r3,r2  
   beq  true_exit
   subs r0,r0,#1 
   bne  loop_top
false_exit: xxx
   bx   lr
true_exit: xxx
   bx   lr

As I said, I don't own the OP's exact hardware, but I will be testing the performance on an nVidia Tegra 3 and Tegra 4 of the 3 different versions and post the results here soon.

Update 3: I ran my code and Microsoft's compiled ARM code on a Tegra 3 and Tegra 4 (Surface RT, Surface RT 2). I ran 1000000 iterations of a loop which fails to find a match so that everything is in cache and it's easy to measure.

             My Code       MS Code
Surface RT    297ns         562ns
Surface RT 2  172ns         296ns  

In both cases my code runs almost twice as fast. Most modern ARM CPUs will probably give similar results.

BitBank
  • 8,004
  • 3
  • 23
  • 42
  • Bonus silly micro-optimisation: forget `r3`, use `subs r0, r0, #8` for the decrement instead, then `r0` will already be zero when you fall out of the 'not found' path. – Notlikethat Sep 04 '14 at 10:33
  • True - I'm used to needing to re-use the original count later in the function, so by default I send it to another register. – BitBank Sep 04 '14 at 10:35
  • 6
    in simple blocks of code you may win the compiler but in more complex situations you can hardly outperform it – phuclv Sep 04 '14 at 11:24
  • 13
    @LưuVĩnhPhúc - that's generally true, but tight ISRs are one of the biggest exceptions, in that you often know a lot more than the compiler does. – sapi Sep 04 '14 at 11:26
  • Isn't it better if you drop `beq found_it`s do two loads on the top and always do cmpne on all regardless? I don't know the capabilities on given cortex-m4, but two branches still seems too much to have in the code. – auselen Sep 04 '14 at 11:39
  • @auselen I wanted to give maximum time for the registers to get loaded since that's usually where the bottleneck is. If working with 0 wait state static RAM, then your idea would be better. – BitBank Sep 04 '14 at 11:42
  • 1
    If I remember correctly, `xor rx, rx` is faster than `mov rx, #0` – RevanProdigalKnight Sep 04 '14 at 15:16
  • 3
    This is a nice answer for generic ARM. See: [LPC4357 Datasheet](http://www.nxp.com/documents/data_sheet/LPC4357_53_37_33.pdf). *The ARM Cortex-M4 supports single-cycle digital signal processing and SIMD instructions.* The SIMD will probably allow comparison of multiple values at a time. – artless noise Sep 04 '14 at 15:26
  • 2
    @artlessnoise - AFAIK the Cortex-M4 doesn't support NEON, but instead has some instructions which can do 8/16-bit SIMD inside 32-bit registers (which wouldn't help in this situation) – BitBank Sep 04 '14 at 15:29
  • Well, sort of true. Two 16bit compares are the same as one 32bit compare. However, you are correct, it is not NEON. It doesn't look like the M4-SIMD supports an SIMD *vector compare* like the NEON does. As well, you can remove the `pld` if you cache lock the memory. The M4 has I/D-TCM and I think it is fast as cache. Positioning the array will also have a speed up; but it is not clear about who is creating the array. – artless noise Sep 04 '14 at 16:59
  • Could one remove the `beq found_it` instructions from the loop if one replaced `cmp r8,r2` with `cmpne r8,r2`, and `subs r3,r3,#1` with `subsne r3,r3,#1`? The loop will exit with r3 non-zero if the value was found, or zero if it wasn't. – supercat Sep 04 '14 at 19:27
  • 47
    Devil's advocate: is there any quantitative evidence that this code is faster? – Oliver Charlesworth Sep 04 '14 at 20:44
  • 5
    @OliCharlesworth - I haven't run the code; I wrote it off the top of my head, but I can tell you from experience that due to the pipelined nature of ARM processors, working with one word at a time versus the way I've written the code will get very different performance. If memory is very fast, then this code can be as much as 6 times faster than the original C code. – BitBank Sep 04 '14 at 21:14
  • 4
    @BitBank You wrote "I haven't run the code". If so, you can't make assertions about its performance. Even extensive personal experience can only generate guesses. – msw Sep 06 '14 at 05:42
  • 1
    @msw - you're right that they're guesses, but I've done this long enough to know that a C compiler is not going to generate what I wrote and mine will be much faster. – BitBank Sep 06 '14 at 08:18
  • 11
    @BitBank: That's not good enough. You have to back up your claims with _evidence_. – Lightness Races in Orbit Sep 06 '14 at 14:42
  • 13
    I learned my lesson years ago. I crafted an amazing optimised inner loop for a graphics routine on a Pentium, using the U and V pipes optimally. Got it down to 6 clock cycles per loop (calculated and measured), and I was very proud of myself. When I tested it against the same thing written in C, the C was faster. I never wrote another line of Intel assembler again. – Rocketmagnet Sep 06 '14 at 22:23
  • 2
    So the only thing you've done is unrolled the loop and combined the branches. No idea about ARM compilers, but those are some of *the* most basic optimizations any serious optimizing compiler would do. Rather shocking if gcc for ARM couldn't manage that itself. – Voo Sep 07 '14 at 14:29
  • Great answer, one thing that would make it better is actually benchmarking the two and showing us that your solution is faster (Compared to Barak Manos's solution in particular) - while it likely is, this should shut up the skeptics. – Benjamin Gruenbaum Sep 08 '14 at 07:48
  • 15
    *"skeptics in the comments who think that my experience is anecdotal/worthless and require proof."* Don't take their comments overly negatively. Showing the proof just makes your great answer all that much better. – Cody Gray Sep 08 '14 at 08:30
  • @Voo I've only played with it on one or two occasions, but it seems that GCC's ARM compiler frequently misses or fails to perform some fairly obvious optimizations. I assume this is the result of less programmer-years of development than the x86 compiler. – Cody Gray Sep 08 '14 at 08:32
  • Not sure about ARM, but many modern CPUs can predict fixed-step memory access patterns and will start prefetching after the first few iterations without explicit prefetch instruction. They also do fancy branch prediction and speculative execution, so an "if" statement can be much cheaper than you would think depending on the data. So in general it's very hard to predict performance on desktop PCs (maybe ARM is different). – maxy Sep 08 '14 at 18:12
  • 2
    Some proof is [here as gcc bug 48789](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48789); gcc doesn't understand the `ldm` nor `stm` instructions. These allow loading of multiple memory values to registers and are commonly coded in `memset()`, [`memcpy()`](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/lib/memcpy.S), etc as they minimize code overhead for data bound ARM code. You can use inline assembler with 'C' to get [the same effect](http://stackoverflow.com/questions/11640062/how-to-do-memory-test-on-arm-architecture-hardware-something-like-memtest86/). – artless noise Sep 09 '14 at 23:13
  • 1
    The OP ended up sorting the values beforehand and doing a binary search ... duh. – Jim Balter Oct 01 '14 at 07:00
  • @Puppy You are a terrible hater. – The Paramagnetic Croissant Aug 22 '15 at 01:45
  • Does NEON have a packed-compare instruction? Since the array is small, you could always loop over the whole thing, and use 128-bit vector OR instructions to combine the vector compare results into a single vector that has a non-zero element if there was a match. (Then horizontal OR this down to a truth value in a scalar register). Unrolling this with multiple accumulators can hide latencies for in-order cores like this one. – Peter Cordes Nov 14 '16 at 19:21
  • @PeterCordes - Yes, you can certainly take this approach with both NEON and SSE. I've written code to find the max/min value in a large unsorted list using SIMD and it definitely speeds it up quite a bit. Vectorizing compilers usually give up if there is a conditional statement in a loop, but it's easy to write intrinisics to handle the test case. – BitBank Nov 14 '16 at 19:40
  • IIRC, ternary operators can help with auto-vectorization, since compilers like the fact that the assignment is unconditional, and only the value is conditional. That's definitely true for convincing compilers to use CMOV instead of a branch. What really defeats gcc is loops with a trip-count that's not predictable (at run time) before entering the loop. So compilers suck at search loops usually, but *always* looping over the whole array should be fine. (Although I suspect you would need intrinsics for what I suggested, and it's probably easier to do that than wrestle with the compiler) – Peter Cordes Nov 14 '16 at 19:58
  • Someone has to tell the truth (instead of repeating conventional wisdom). – Peter Mortensen Aug 18 '18 at 13:39
  • `gcc -O2` doesn't include `-funroll-loops` in gcc4.8. That's only enabled by `-fprofile-use`, or if you specify it manually. `-O3` used to include `-funroll-loops`, and I don't know when it changed, but adding that option does change code-gen for the OP's C loop with gcc4.6 or gcc8.2 on Godbolt https://godbolt.org/z/d6yDF7. (Still nowhere near as good as your hand-written asm, though; it just replicates multiple ldr/cmp/beq blocks). GCC / clang suck at loops where the trip-count isn't calculable before entering the loop, e.g. they can't auto-vectorize search loops like this. – Peter Cordes Jan 21 '19 at 15:46
88

There's a trick for optimizing it (I was asked this on a job-interview once):

  • If the last entry in the array holds the value that you're looking for, then return true
  • Write the value that you're looking for into the last entry in the array
  • Iterate the array until you encounter the value that you're looking for
  • If you've encountered it before the last entry in the array, then return true
  • Return false

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t i;
    uint32_t x = theArray[SIZE-1];
    if (x == compareVal)
        return true;
    theArray[SIZE-1] = compareVal;
    for (i = 0; theArray[i] != compareVal; i++);
    theArray[SIZE-1] = x;
    return i != SIZE-1;
}

This yields one branch per iteration instead of two branches per iteration.


UPDATE:

If you're allowed to allocate the array to SIZE+1, then you can get rid of the "last entry swapping" part:

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t i;
    theArray[SIZE] = compareVal;
    for (i = 0; theArray[i] != compareVal; i++);
    return i != SIZE;
}

You can also get rid of the additional arithmetic embedded in theArray[i], using the following instead:

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t *arrayPtr;
    theArray[SIZE] = compareVal;
    for (arrayPtr = theArray; *arrayPtr != compareVal; arrayPtr++);
    return arrayPtr != theArray+SIZE;
}

If the compiler doesn't already apply it, then this function will do so for sure. On the other hand, it might make it harder on the optimizer to unroll the loop, so you will have to verify that in the generated assembly code...

Alon Gubkin
  • 53,054
  • 52
  • 181
  • 282
barak manos
  • 27,530
  • 9
  • 49
  • 106
  • @auselen: Thanks... well if it's a read-only memory then this solution is not feasible. – barak manos Sep 04 '14 at 11:50
  • @auselen: Then you copy it to RAM. You'd want that anyway because RAM is faster. – MSalters Sep 04 '14 at 11:50
  • @MSalters copy from flash + compare on ram > compare on flash ? If you do it only once I guess that won't be useful. – auselen Sep 04 '14 at 11:53
  • if you can copy into ram then allocate one extra space and put compareVal in there, no need for the first branch then – ratchet freak Sep 04 '14 at 12:14
  • @themik81: I rejected it your post-edit was wrong (it changed the code to work incorrectly). – barak manos Sep 04 '14 at 12:14
  • 2
    @ratchetfreak: OP does not provide any details on how, where and when this array is allocated and initialized, so I gave an answer that does not depend on that. – barak manos Sep 04 '14 at 12:16
  • 3
    Array is in RAM, writes are not allowed though. – wlamers Sep 04 '14 at 14:21
  • @wlamers: If both `&theArray` and `compareVal` are constant throughout the execution of the program, then you can place the constant value of `compareVal` immediately after the array (as part of the executable image, **not** during runtime), and then use only a small portion of the code that I've provided - the `for` loop, and the return value being `i < SIZE` instead of `i < SIZE-1`. – barak manos Sep 04 '14 at 15:23
  • 1
    nice, but the array is no longer `const`, which makes this not thread-safe. Seems like a high price to pay. – EOF Sep 04 '14 at 17:12
  • @Shadow503: Thanks :) – barak manos Sep 04 '14 at 18:09
  • 2
    @EOF: Where was `const` ever mentioned in the question? – barak manos Sep 04 '14 at 18:10
  • 4
    @barakmanos: If I pass an array and a value to you, and ask you whether the value is in the array, I don't usually assume you'll be modifying the array. The original question mentions neither `const` nor threads, but I think it's fair to mention this caveat. – EOF Sep 04 '14 at 19:29
  • @EOF: Well, after OP wrote in one of the comments "writes are not allowed", I added an update, which you can read at the second part of the question. In short, you can simply add an entry at the end of the array, which no other thread uses but the thread which searches for `compareVal`. In addition, if `theArray` is located at the same memory address throughout the execution of the program, and the value of `compareVal` is constant throughout the execution of the program, then you can further optimize it by setting `theArray[SIZE] = compareVal` (or even as part of the executable image) once. – barak manos Sep 04 '14 at 19:46
  • 1
    I'll clarify what I mean: Say two threads search for two different values in the same array simultaneously (or concurrently, whichever). Now both try to modify the last value (or the sentinel value you want to write past the end of the array [btw, you'd have to allocate a +1 sized array to even do that, otherwise undefined behaviour]) and **boom**, equivalent to a non null-terminated string in one of the threads. – EOF Sep 04 '14 at 19:52
  • @EOF: The allocation is mentioned as part of the answer (not that it is that critical to the question, as it's pretty obvious that you need to allocate a bigger array, or an element "sitting" right next to it). In any case, in the second part of the answer (which refers to the case of a read-only array), I have clearly mentioned that it is relevant for a **constant** `compareVal`. – barak manos Sep 04 '14 at 19:57
  • If both the array and the value you look for are constant, **why would you need to search for it in the first place?**. And you can't get '[...] an element "sitting" right next to it' in C. You get variables with some storage. This really has to be the equivalent to a string (hence the +1 size...), and contiguous allocation from the beginning. – EOF Sep 05 '14 at 02:54
  • @EOF: The **address** of the array is constant, not the contents of the array!!! Regarding the "element sitting next to it" - it's a common thing to do in embedded systems (which is what OP seems to be working on). You allocate a `uint32_t` right after the array. You make sure that it is "write after the array" through the linker command-file (assuming that they are both global variables, it is absolutely feasible, and in fact, a very common thing to do on these systems). In any case, you might as well declare the array to size `SIZE+1`. I might change it to this, just to make it simpler. – barak manos Sep 05 '14 at 06:07
  • @EOF: A couple of additional things related to your previous comments. OP mentions that this code runs in the context of an ISR, so it doesn't need to be thread-safe, and I doubt that it even needs to be ISR-safe. In any case, even if the code does need to be thread-safe, and assuming that we have a small number of `N` threads, we can simply allocate `uint32_t theArray[SIZE+N]`, and let each thread write `theArray[SIZE+threadId] = compareVal`. This would make the whole thing thread-safe while keeping the complexity at `O(n)`, since `N` is much smaller than `SIZE`. – barak manos Sep 05 '14 at 06:11
  • Quite some discussion going on here. If I knew that all this would be important I had better give some more details in the questions. My apologies. The above suggestion could work in my application. To clarify some more: the array is in RAM because it changes once in a while, but never in the ISR itself. I could add an extra item which may be changed in the ISR, this is feasible. Thread safety is (luckely) no issue. Both the array and pointer+compareval can be accessed simulataniously by the cpu since they are in different RAM's which can be accessed by two seperate busses. – wlamers Sep 05 '14 at 07:42
  • 1
    This already speed thing up. In the mean time (see comment in the first post) I solved the issue using a binary search. This speeds things up from 12.4 to 3.9us is the worst case. Quite some improvement. This could probably be even faster, but the time/effort required to do that does not weigh up to the gain. I am very happy with the result. Thank all for your input! – wlamers Sep 05 '14 at 07:44
  • @wlamers: You're welcome. If you're allowed to allocate the array one entry larger, then it might work out for you without changing the value of that last entry (as explain in the second part of the answer). Of course, if you're allowed to sort the array beforehand (or if you have it sorted already), then it's a totally different thing, and I guess that a binary-search beats all other methods. In any case, thank you for the interesting question, and for the (currently) second highest score I got for an answer here :) – barak manos Sep 05 '14 at 07:52
  • @Christian: Thanks :) – barak manos Oct 04 '14 at 04:43
  • "the array is in RAM because it changes once in a while, but never in the ISR itself." It's important to bear in mind that for a binary search solution to work, changing the array's contents requires maintaining its sort order. If the array is modified in-place, this requires inhibiting the ISR while the sort is being performed, at the cost of some interrupt responsiveness. Another option might be to prepare the updated table elsewhere and swap it in when it's ready, at the cost of memory. Other options also suggest themselves. Presumably your solution takes this consideration into account. – Jeremy Aug 02 '17 at 15:13
63

You're asking for help with optimising your algorithm, which may push you to assembler. But your algorithm (a linear search) is not so clever, so you should consider changing your algorithm. E.g.:

Perfect hash function

If your 256 "valid" values are static and known at compile time, then you can use a perfect hash function. You need to find a hash function that maps your input value to a value in the range 0..n, where there are no collisions for all the valid values you care about. That is, no two "valid" values hash to the same output value. When searching for a good hash function, you aim to:

  • Keep the hash function reasonably fast.
  • Minimise n. The smallest you can get is 256 (minimal perfect hash function), but that's probably hard to achieve, depending on the data.

Note for efficient hash functions, n is often a power of 2, which is equivalent to a bitwise mask of low bits (AND operation). Example hash functions:

  • CRC of input bytes, modulo n.
  • ((x << i) ^ (x >> j) ^ (x << k) ^ ...) % n (picking as many i, j, k, ... as needed, with left or right shifts)

Then you make a fixed table of n entries, where the hash maps the input values to an index i into the table. For valid values, table entry i contains the valid value. For all other table entries, ensure that each entry of index i contains some other invalid value which doesn't hash to i.

Then in your interrupt routine, with input x:

  1. Hash x to index i (which is in the range 0..n)
  2. Look up entry i in the table and see if it contains the value x.

This will be much faster than a linear search of 256 or 1024 values.

I've written some Python code to find reasonable hash functions.

Binary search

If you sort your array of 256 "valid" values, then you can do a binary search, rather than a linear search. That means you should be able to search 256-entry table in only 8 steps (log2(256)), or a 1024-entry table in 10 steps. Again, this will be much faster than a linear search of 256 or 1024 values.

Community
  • 1
  • 1
Craig McQueen
  • 37,399
  • 27
  • 113
  • 172
  • Thanks for that. The binary search option is the one I have chosen. See also an earlier comment in the first post. This does the trick very well without using assembly. – wlamers Sep 05 '14 at 07:37
  • 11
    Indeed, before trying to optimize your code (such as using assembly or other tricks) you should probably see if you can reduce the algorithmic complexity. Usually reducing the algorithmic complexity will be more efficient than trying to scap a few cycles but keeping the same algorithmic complexity. – ysdx Sep 06 '14 at 07:21
  • 3
    +1 for binary search. Algorithmic re-design is the best way to optimise. – Rocketmagnet Sep 06 '14 at 22:16
  • A popular notion is that it takes too much effort to find an efficient hash routine so the "best practice" is a binary search. Sometimes though, "best practice" is not good enough. Suppose you are routing network traffic on the fly at the moment when a packet's header has arrived (but not its payload): using a binary search would make your product hopelessly slow. Embedded products usually have such constraints and requirements that what is "best practice" in, for example, an x86 execution environment is "taking the easy way out" in embedded. – Olof Forshell Jun 23 '15 at 12:22
62

Keep the table in sorted order, and use Bentley's unrolled binary search:

i = 0;
if (key >= a[i+512]) i += 512;
if (key >= a[i+256]) i += 256;
if (key >= a[i+128]) i += 128;
if (key >= a[i+ 64]) i +=  64;
if (key >= a[i+ 32]) i +=  32;
if (key >= a[i+ 16]) i +=  16;
if (key >= a[i+  8]) i +=   8;
if (key >= a[i+  4]) i +=   4;
if (key >= a[i+  2]) i +=   2;
if (key >= a[i+  1]) i +=   1;
return (key == a[i]);

The point is,

  • if you know how big the table is, then you know how many iterations there will be, so you can fully unroll it.
  • Then, there's no point testing for the == case on each iteration because, except on the last iteration, the probability of that case is too low to justify spending time testing for it.**
  • Finally, by expanding the table to a power of 2, you add at most one comparison, and at most a factor of two storage.

** If you're not used to thinking in terms of probabilities, every decision point has an entropy, which is the average information you learn by executing it. For the >= tests, the probability of each branch is about 0.5, and -log2(0.5) is 1, so that means if you take one branch you learn 1 bit, and if you take the other branch you learn one bit, and the average is just the sum of what you learn on each branch times the probability of that branch. So 1*0.5 + 1*0.5 = 1, so the entropy of the >= test is 1. Since you have 10 bits to learn, it takes 10 branches. That's why it's fast!

On the other hand, what if your first test is if (key == a[i+512)? The probability of being true is 1/1024, while the probability of false is 1023/1024. So if it's true you learn all 10 bits! But if it's false you learn -log2(1023/1024) = .00141 bits, practically nothing! So the average amount you learn from that test is 10/1024 + .00141*1023/1024 = .0098 + .00141 = .0112 bits. About one hundredth of a bit. That test is not carrying its weight!

Mike Dunlavey
  • 38,662
  • 12
  • 86
  • 126
  • 5
    I really like this solution. It can be modified to run in a fixed number of cycles to avoid timing-based forensics if the location of the value is sensitive information. – OregonTrail Sep 05 '14 at 16:17
  • 1
    @OregonTrail: Timing-based forensics? Fun problem, but sad comment. – Mike Dunlavey Sep 05 '14 at 17:00
  • 17
    You see unrolled loops like this in crypto libraries to prevent Timing Attacks https://en.wikipedia.org/wiki/Timing_attack. Here's a good example https://github.com/jedisct1/libsodium/blob/e06ae6db9d843dd9614d34bc1a55977a6a403c3f/src/libsodium/crypto_verify/32/ref/verify_32.c In this case we are preventing an attacker from guessing the length of a string. Usually the attacker will take several million samples of a function invocation to perform a timing attack. – OregonTrail Sep 05 '14 at 17:19
  • 3
    +1 Great! Nice little unrolled search. I hadn't seen that before. I might use it. – Rocketmagnet Sep 06 '14 at 22:16
  • Elegant and fast! It Couldn't be better! – Christian Oct 03 '14 at 17:50
  • 1
    @OregonTrail: I second your timing-based comment. I have more than once had to write cryptographic code that executes in a fixed number of cycles, to avoid leaking information to timing-based attacks. – TonyK Nov 25 '14 at 12:41
16

If the set of constants in your table is known in advance, you can use perfect hashing to ensure that only one access is made to the table. Perfect hashing determines a hash function that maps every interesting key to a unique slot (that table isn't always dense, but you can decide how un-dense a table you can afford, with less dense tables typically leading to simpler hashing functions).

Usually, the perfect hash function for the specific set of keys is relatively easy to compute; you don't want that to be long and complicated because that competes for time perhaps better spent doing multiple probes.

Perfect hashing is a "1-probe max" scheme. One can generalize the idea, with the thought that one should trade simplicity of computing the hash code with the time it takes to make k probes. After all, the goal is "least total time to look up", not fewest probes or simplest hash function. However, I've never seen anybody build a k-probes-max hashing algorithm. I suspect one can do it, but that's likely research.

One other thought: if your processor is extremely fast, the one probe to memory from a perfect hash probably dominates the execution time. If the processor is not very fast, than k>1 probes might be practical.

Ira Baxter
  • 88,629
  • 18
  • 158
  • 311
  • 1
    A Cortex-M is nowhere near _extremely fast_. – MSalters Sep 04 '14 at 22:59
  • 2
    In fact in this case he doesn't need any hash table at all. He only wants to know if a certain key is in the set, he doesn't want to map it to a value. So it's enough if the perfect hash function maps each 32 bit value to either 0 or 1 where "1" could be defined as "is in the set". – David Ongaro Sep 05 '14 at 00:24
  • 1
    Good point, if he can get a perfect hash generator to produce such a mapping. But, that would be "an extremely dense set"; I doube he can find a perfect hash generator that does that. He might be better off trying to get a perfect hash that produces some constant K if in the set, and any value but K if not in the set. I suspect it is hard to get a perfect hash even for the latter. – Ira Baxter Sep 05 '14 at 00:31
  • @DavidOngaro `table[PerfectHash(value)] == value` yields 1 if the value is in the set and 0 if it isn't, and there are well known ways to produce the PerfectHash function (see, e.g., http://burtleburtle.net/bob/hash/perfect.html). Trying to find a hash function that directly maps all values in the set into 1 and all values not in the set to 0 is a foolhardy task. – Jim Balter Oct 01 '14 at 08:28
  • @DavidOngaro: a perfect hash function has many "false positives", which is to say, values *not* in the set would have the same hash as values in the set. So you have to have a table, indexed by the hash value, containing the "in-the-set" input value. So to validate any given input value you (a) hash it; (b) use the hash value to do the table look-up; (c) check if the entry in the table matches the input value. – Craig McQueen Nov 14 '16 at 23:07
  • @CraigMcQueen: a simple counter example would be `value & 1` which would be a perfect hash function for the set of even (or uneven) integers (given that `value` as type `integer` is fixed). But it simply depends on your usecase, e.g. if you want to have a perfect hash to use as a jumptable for a list of keywords and you can not be sure that your input string is in the list of your keywords, or at least is very constrained, then yes you need to store the value as well. – David Ongaro Nov 15 '16 at 00:03
  • @CraigMcQueen: Mathematically you can always have a "perfect hash" function of a set, by defining it as the list of all possible values and a function which map all input values to 1 or 0 depending wether or not it is in the the list. In this way the "perfect hash" would be just a representation of the set. Practically it only makes sense to talk about a "perfect hash function of a set" when the complexity stays near O(1) and the space requirement somewhat below O(n). – David Ongaro Nov 15 '16 at 00:14
14

Use a hash set. It will give O(1) lookup time.

The following code assumes that you can reserve value 0 as an 'empty' value, i.e. not occurring in actual data. The solution can be expanded for a situation where this is not the case.

#define HASH(x) (((x >> 16) ^ x) & 1023)
#define HASH_LEN 1024
uint32_t my_hash[HASH_LEN];

int lookup(uint32_t value)
{
    int i = HASH(value);
    while (my_hash[i] != 0 && my_hash[i] != value) i = (i + 1) % HASH_LEN;
    return i;
}

void store(uint32_t value)
{
    int i = lookup(value);
    if (my_hash[i] == 0)
       my_hash[i] = value;
}

bool contains(uint32_t value)
{
    return (my_hash[lookup(value)] == value);
}

In this example implementation, the lookup time will typically be very low, but at the worst case can be up to the number of entries stored. For a realtime application, you can consider also an implementation using binary trees, which will have a more predictable lookup time.

jpa
  • 7,972
  • 1
  • 21
  • 37
  • 3
    It depends on how many times this lookup has to be done for this to be effective. – maxywb Sep 04 '14 at 16:05
  • 1
    Er, lookup can run off the end of the array. And this sort of linear hashing has high collision rates -- no way you'll get O(1). Good hash sets aren't implemented like this. – Jim Balter Oct 01 '14 at 07:35
  • @JimBalter True, not perfect code. More like the general idea; could have just pointed to existing hash set code. But considering that this is an interrupt service routine it may be useful to demonstrate that the lookup is not very complex code. – jpa Oct 01 '14 at 08:34
  • You should just fix it so it wraps i around. – Jim Balter Oct 01 '14 at 09:00
  • The point of a perfect hash function is that it does one probe. Period. – Ira Baxter Nov 14 '16 at 23:20
  • Why is `i` signed `int`? Probably the compiler can prove that it stays non-negative (and thus `% HASH_LEN` can be implemented as `& (HASH_LEN - 1)`), but you might lead the compiler to emit code that accounts for signed remainder semantics. – Peter Cordes Jan 21 '19 at 15:29
11

In this case, it might be worthwhile investigating Bloom filters. They're capable of quickly establishing that a value is not present, which is a good thing since most of the 2^32 possible values are not in that 1024 element array. However, there are some false positives that will need an extra check.

Since your table is apparently static, you can determine which false positives exist for your Bloom filter and put those in a perfect hash.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
MSalters
  • 159,923
  • 8
  • 140
  • 320
8

Assuming your processor runs at 204 MHz which seems to be the maximum for the LPC4357, and also assuming your timing result reflects the average case (half of the array traversed), we get:

  • CPU frequency: 204 MHz
  • Cycle period: 4.9 ns
  • Duration in cycles: 12.5 µs / 4.9 ns = 2551 cycles
  • Cycles per iteration: 2551 / 128 = 19.9

So, your search loop spends around 20 cycles per iteration. That doesn't sound awful, but I guess that in order to make it faster you need to look at the assembly.

I would recommend dropping the index and using a pointer comparison instead, and making all the pointers const.

bool arrayContains(const uint32_t *array, size_t length)
{
  const uint32_t * const end = array + length;
  while(array != end)
  {
    if(*array++ == 0x1234ABCD)
      return true;
  }
  return false;
}

That's at least worth testing.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
unwind
  • 364,555
  • 61
  • 449
  • 578
  • 1
    -1, ARM has an indexed address mode so this is pointless. As for making the pointer `const`, GCC already spots that it doesn't change. The `const` doesnt't add anything either. – MSalters Sep 04 '14 at 11:49
  • 11
    @MSalters OK, I didn't verify with the generated code, the point was to express something that makes it simpler at the C level, and I think just managing pointers instead of a pointer and an index *is* simpler. I simply disagree that "`const` doesn't add anything": it very clearly tells the reader that the value won't change. That is fantastic information. – unwind Sep 04 '14 at 12:09
  • 9
    This is deeply embedded code; optimizations so far have included moving the code from flash to RAM. And yet it still needs to be faster. At this point, readability is _not_ the goal. – MSalters Sep 04 '14 at 22:43
  • 1
    @MSalters "ARM has an indexed address mode so this is pointless" -- well, if you completely miss the point ... the OP wrote "I also use pointer arithmetic and a for loop". unwind didn't replace indexing with pointers, he just eliminated the index variable and thus an extra subtract on every loop iteration. But the OP was wise (unlike many of the people answering and commenting) and ended up doing a binary search. – Jim Balter Oct 01 '14 at 20:38
7

Other people have suggested reorganizing your table, adding a sentinel value at the end, or sorting it in order to provide a binary search.

You state "I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i != 0 is faster than checking if i < 256)."

My first advice is: get rid of the pointer arithmetic and the downcounting. Stuff like

for (i=0; i<256; i++)
{
    if (compareVal == the_array[i])
    {
       [...]
    }
}

tends to be idiomatic to the compiler. The loop is idiomatic, and the indexing of an array over a loop variable is idiomatic. Juggling with pointer arithmetic and pointers will tend to obfuscate the idioms to the compiler and make it generate code related to what you wrote rather than what the compiler writer decided to be the best course for the general task.

For example, the above code might be compiled into a loop running from -256 or -255 to zero, indexing off &the_array[256]. Possibly stuff that is not even expressible in valid C but matches the architecture of the machine you are generating for.

So don't microoptimize. You are just throwing spanners into the works of your optimizer. If you want to be clever, work on the data structures and algorithms but don't microoptimize their expression. It will just come back to bite you, if not on the current compiler/architecture, then on the next.

In particular using pointer arithmetic instead of arrays and indexes is poison for the compiler being fully aware of alignments, storage locations, aliasing considerations and other stuff, and for doing optimizations like strength reduction in the way best suited to the machine architecture.

Grijesh Chauhan
  • 52,958
  • 19
  • 127
  • 190
  • Loops over pointers are idiomatic in C and good optimizing compilers can handle them just as well as indexing. But this whole thing is moot because the OP ended up doing a binary search. – Jim Balter Oct 01 '14 at 07:26
4

If you can accommodate the domain of your values with the amount of memory that's available to your application, then, the fastest solution would be to represent your array as an array of bits:

bool theArray[MAX_VALUE]; // of which 1024 values are true, the rest false
uint32_t compareVal = 0x1234ABCD;
bool validFlag = theArray[compareVal];

EDIT

I'm astounded by the number of critics. The title of this thread is "How do I quickly find whether a value is present in a C array?" for which I will stand by my answer because it answers precisely that. I could argue that this has the most speed efficient hash function (since address === value). I've read the comments and I'm aware of the obvious caveats. Undoubtedly those caveats limit the range of problems this can be used to solve, but, for those problems that it does solve, it solves very efficiently.

Rather than reject this answer outright, consider it as the optimal starting point for which you can evolve by using hash functions to achieve a better balance between speed and performance.

Stephen Quan
  • 15,118
  • 3
  • 69
  • 63
  • 8
    How does this get 4 upvotes? The question states it's a Cortex M4. The thing has 136 KB RAM, not 262.144 KB. – MSalters Sep 05 '14 at 22:40
  • 1
    It's astounding how many upvotes were given to manifestly wrong answers because the answerer missed the forest for the trees. For the OP's largest case O(log n) << O(n). – msw Sep 06 '14 at 05:59
  • 3
    I get very grumpy at programmers who burn ridiculous amounts of memory, when there are far better solutions available. Every 5 years it seems that my PC is running out of memory, where 5 years ago that amount was plenty. – Craig McQueen Sep 08 '14 at 01:07
  • 1
    @CraigMcQueen Kids these days. Wasting memory. Outrageous! Back in my days, we had 1 MiB of memory and a word size of 16-bits. /s – Cole Johnson Sep 08 '14 at 06:09
  • 2
    What's with the harsh critics? The OP clearly states the speed is absolutely critical for this portion of code, and StephenQuan already mentioned a "ridiculous amount of memory". – Bogdan Alexandru Sep 08 '14 at 07:12
4

Vectorization can be used here, as it is often is in implementations of memchr. You use the following algorithm:

  1. Create a mask of your query repeating, equal in length to your OS'es bit count (64-bit, 32-bit, etc.). On a 64-bit system you would repeat the 32-bit query twice.

  2. Process the list as a list of multiple pieces of data at once, simply by casting the list to a list of a larger data type and pulling values out. For each chunk, XOR it with the mask, then XOR with 0b0111...1, then add 1, then & with a mask of 0b1000...0 repeating. If the result is 0, there is definitely not a match. Otherwise, there may (usually with very high probability) be a match, so search the chunk normally.

Example implementation: https://sourceware.org/cgi-bin/cvsweb.cgi/src/newlib/libc/string/memchr.c?rev=1.3&content-type=text/x-cvsweb-markup&cvsroot=src

meisel
  • 1,783
  • 2
  • 17
  • 30
1

I'm sorry if my answer was already answered - just I'm a lazy reader. Feel you free to downvote then ))

1) you could remove counter 'i' at all - just compare pointers, ie

for (ptr = &the_array[0]; ptr < the_array+1024; ptr++)
{
    if (compareVal == *ptr)
    {
       break;
    }
}
... compare ptr and the_array+1024 here - you do not need validFlag at all.

all that won't give any significant improvement though, such optimization probably could be achieved by the compiler itself.

2) As it was already mentioned by other answers, almost all modern CPU are RISC-based, for example ARM. Even modern Intel X86 CPUs use RISC cores inside, as far as I know (compiling from X86 on fly). Major optimization for RISC is pipeline optimization (and for Intel and other CPU as well), minimizing code jumps. One type of such optimization (probably a major one), is "cycle rollback" one. It's incredibly stupid, and efficient, even Intel compiler can do that AFAIK. It looks like:

if (compareVal == the_array[0]) { validFlag = true; goto end_of_compare; }
if (compareVal == the_array[1]) { validFlag = true; goto end_of_compare; }
...and so on...
end_of_compare:

This way the optimization is that the pipeline is not broken for the worst case (if compareVal is absent in the array), so it is as fast as possible (of course not counting algorithm optimizations such as hash tables, sorted arrays and so on, mentioned in other answers, which may give better results depending on array size. Cycles Rollback approach can be applied there as well by the way. I'm writing here about that I think I didn't see in others)

The second part of this optimization is that that array item is taken by direct address (calculated at compiling stage, make sure you use a static array), and do not need additional ADD op to calculate pointer from array's base address. This optimization may not have significant effect, since AFAIK ARM architecture has special features to speed up arrays addressing. But anyway it's always better to know that you did all the best just in C code directly, right?

Cycle Rollback may look awkward due to waste of ROM (yep, you did right placing it to fast part of RAM, if your board supports this feature), but actually it's a fair pay for speed, being based on RISC concept. This is just a general point of calculation optimization - you sacrifice space for sake of speed, and vice versa, depending on your requirements.

If you think that rollback for array of 1024 elements is too large sacrifice for your case, you can consider 'partial rollback', for example dividing the array into 2 parts of 512 items each, or 4x256, and so on.

3) modern CPU often support SIMD ops, for example ARM NEON instruction set - it allows to execute the same ops in parallel. Frankly speaking I do not remember if it is suitable for comparison ops, but I feel it may be, you should check that. Googling shows that there may be some tricks as well, to get max speed, see https://stackoverflow.com/a/5734019/1028256

I hope it can give you some new ideas.

Community
  • 1
  • 1
Mixaz
  • 3,749
  • 25
  • 51
  • The OP bypassed all the foolish answers focused on optimizing linear loops, and instead presorted the array and did binary search. – Jim Balter Oct 01 '14 at 07:19
  • @Jim, it is obvious that that kind of optimization should be made first. 'Foolish' answers may look not so foolish in some use cases when for example you do not have time to sort the array. Or if the speed you get, is not enough anyway – Mixaz Oct 09 '14 at 10:49
  • "it is obvious that that kind of optimization should be made first" -- obviously not to the people who went to great effort to develop linear solutions. "you do not have time to sort the array" -- I have no idea what that means. "Or if the speed you get, is not enough anyway" -- Uh, if the speed from a binary search is "not enough", doing an optimized linear search won't improve it. Now I'm done with this subject. – Jim Balter Oct 09 '14 at 19:34
  • @JimBalter, if I had such problem as OP, I certainly would consider using algs like binary search or something. I just couldn't think that OP didn't consider it already. "you do not have time to sort the array" means that sorting array takes time. If you need to do it for each input data set, it may take longer time than a linear loop. "Or if the speed you get, is not enough anyway" means following - optimization hints above could be used to speed up binary search code or whatsoever – Mixaz Oct 10 '14 at 20:38
1

This is more like an addendum than an answer.

I've had a similar case in the past, but my array was constant over a considerable number of searches.

In half of them, the searched value was NOT present in array. Then I realized I could apply a "filter" before doing any search.

This "filter" is just a simple integer number, calculated ONCE and used in each search.

It's in Java, but it's pretty simple:

binaryfilter = 0;
for (int i = 0; i < array.length; i++)
{
    // just apply "Binary OR Operator" over values.
    binaryfilter = binaryfilter | array[i];
}

So, before do a binary search, I check binaryfilter:

// Check binaryfilter vs value with a "Binary AND Operator"
if ((binaryfilter & valuetosearch) != valuetosearch)
{
    // valuetosearch is not in the array!
    return false;
}
else
{
    // valuetosearch MAYBE in the array, so let's check it out
    // ... do binary search stuff ...

}

You can use a 'better' hash algorithm, but this can be very fast, specially for large numbers. May be this could save you even more cycles.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Christian
  • 6,334
  • 9
  • 48
  • 76
1

Make sure the instructions ("the pseudo code") and the data ("theArray") are in separate (RAM) memories so CM4 Harvard architecture is utilized to its full potential. From the user manual:

enter image description here

To optimize the CPU performance, the ARM Cortex-M4 has three buses for Instruction (code) (I) access, Data (D) access, and System (S) access. When instructions and data are kept in separate memories, then code and data accesses can be done in parallel in one cycle. When code and data are kept in the same memory, then instructions that load or store data may take two cycles.

Following this guideline I observed ~30% speed increase (FFT calculation in my case).

francek
  • 462
  • 3
  • 11
  • Interesting, Cortex-M7 has optional instruction/data caches, but before that definitely not. https://en.wikipedia.org/wiki/ARM_Cortex-M#Silicon_customization. – Peter Cordes Jan 21 '19 at 15:23
0

I'm a great fan of hashing. The problem of course is to find an efficient algorithm that is both fast and uses a minimum amount of memory (especially on an embedded processor).

If you know beforehand the values that may occur you can create a program that runs through a multitude of algorithms to find the best one - or, rather, the best parameters for your data.

I created such a program that you can read about in this post and achieved some very fast results. 16000 entries translates roughly to 2^14 or an average of 14 comparisons to find the value using a binary search. I explicitly aimed for very fast lookups - on average finding the value in <=1.5 lookups - which resulted in greater RAM requirements. I believe that with a more conservative average value (say <=3) a lot of memory could be saved. By comparison the average case for a binary search on your 256 or 1024 entries would result in an average number of comparisons of 8 and 10, respectively.

My average lookup required around 60 cycles (on a laptop with an intel i5) with a generic algorithm (utilizing one division by a variable) and 40-45 cycles with a specialized (probably utilizing a multiplication). This should translate into sub-microsecond lookup times on your MCU, depending of course on the clock frequency it executes at.

It can be real-life-tweaked further if the entry array keeps track of how many times an entry was accessed. If the entry array is sorted from most to least accessed before the indeces are computed then it'll find the most commonly occuring values with a single comparison.

Community
  • 1
  • 1
Olof Forshell
  • 3,001
  • 20
  • 25