Simulating LDREX/STREX (load/store exclusive) in Cortex-M0

Question

In the Cortex-M3 instruction set, there exist a family of LDREX/STREX instructions such that if a location is read with an LDREX instruction, a following STREX instruction can write to that address only if the address is known to have been untouched. Typically, the effect is that the STREX will succeed if no interrupts ("exceptions" in ARM parlance) have occurred since the LDREX, but fail otherwise.

What's the most practical way to simulate such behavior in the Cortex M0? I would like to write C code for the M3 and have it portable to the M0. On the M3, one can say something like:

__inline void do_inc(unsigned int *dat)
{
  while(__strex(__ldrex(dat)+1,dat)) {}
}

to perform an atomic increment. The only ways I can think of to achieve similar functionality on the Cortex-M0 would be to either:

Have "ldrex" disable exceptions and have "strex" and "clrex" re-enable them, with the requirement that every "ldrex" must be followed soon thereafter by either a "strex" or "clrex".
Have "ldrex", "strex", and "clrex" be a very small routines in RAM, with one instruction of "ldrex" being patched to either "str r1,[r2]" or "mov r0,#1". Have the "ldrex" routine plug a "str" instruction into the "strex" routine, and have the "clrex" routine plug "mov r0,#1" there. Have all exceptions that might invalidate a "ldrex" sequence call "clrex".

Depending upon how the ldrex/strex functions are used, disabling interrupts might work reasonably, but it seems icky to change the semantics of "load-exclusive" so as to cause bad side-effects if it's abandoned. The code-patching idea seems like it would achieve the desired semantics, but it seems clunky.

(BTW, side question: I wonder why STREX on the M3 stores the success/failure indication to a register rather than simply setting a flag? Its actual operation requires four extra bits in the opcode, requires that a register be available to hold the success/failure indication, and requires that a "cmp r0,#0" be used to determine if it succeeded. Was it expected that compilers wouldn't be able to handle a STREX intrinsic sensibly if they didn't get the result in a register? Getting Carry into a register takes two short instructions.)

domen · Answer 1 · 2011-06-01T07:18:26.973

5

~~Well... you still have SWP remaining, but it's a less powerful atomic instruction.~~

Interrupt disabling is sure to work though. :-)

Edit:

No SWP on -m0, sorry supercat.

OK, seems you're only left with interrupt disabling. You can use gcc-compilable inline asm as a guide how to disable and properly restore it: http://repo.or.cz/w/cbaos.git/blob/HEAD:/arch/arm-cortex-m0/include/lock.h

edited Jun 01 '11 at 07:18

answered Apr 25 '11 at 10:22

domen

1,585
10
18

3

Where is SWP documented for the Cortex M0? As for disabling interrupts, is there any nice way within C to save/restore the interrupt-enable flag, so that a ldrex/strex sequence that's performed with interrupts disabled would leave interrupts disabled? – supercat Apr 25 '11 at 15:12
Of course, you can **not** do this in 'C'; but in-line assembler is possible. `asm(" mrs %0, cpsr\n orr %1, %0, #128\n msr cpsr_c, %1\n" : "=r" (old), "=r" (new) : : "memory", "cc");`. You must be in a mode to permit it. If you have the `cpsid`, it can be easier. I don't know too much about the M0. – artless noise Mar 19 '14 at 19:21

Alexandre Pereira Nunes · Answer 2 · 2016-09-29T19:20:41.877

The Cortex-M3 was designed to heavy low-latency and low-jitter multitasking, i.e. it's interrupt controller cooperates with the core in order to keep guarantees on number of cycles since interrupt triggering to interrupt handling. The ldrex/strex was implemented as a way to cooperate with all that (by all that I mean interrupt masking and other details such as atomic bit setting via bitband aliases), as a single core, non-MMU, non-cache µC would otherwise have little use for it. If it didn't implement it, a low priority task would have to hold a lock and tha could generate small priority inversions, generating latency and jitter which a hard real time system (it's design for this, although the concept is way too broad) can't cope with, at least not within the order of magnitude allowed by the "retry" semantics that a failed ldrex/ strex has.

On a side note, and speaking strictly in terms of timings and jitter, the Cortex-M0 has a more traditional interrupt timing profile (i.e. it will not abort instructions on the core when an interrupt arrive), being subject to way more jitter and latency. On this matter (again, strictly timing), it's more comparable to older models (i.e. the arm7tdmi), which also lacks atomic load/modify/store as well as atomic increments & decrements and other low-latency cooperative instructions, requiring interrupt disable/enable more often.

I use something like this in Cortex-M3:

#define unlikely(x) __builtin_expect((long)(x),0)
    static inline int atomic_LL(volatile void *addr) {
      int dest;

  __asm__ __volatile__("ldrex %0, [%1]" : "=r" (dest) : "r" (addr));
  return dest;
}

static inline int atomic_SC(volatile void *addr, int32_t value) {
  int dest;

  __asm__ __volatile__("strex %0, %2, [%1]" :
          "=&r" (dest) : "r" (addr), "r" (value) : "memory");
  return dest;
}

/**
 * atomic Compare And Swap
 * @param addr Address
 * @param expected Expected value in *addr
 * @param store Value to be stored, if (*addr == expected).
 * @return 0  ok, 1 failure.
 */
static inline int atomic_CAS(volatile void *addr, int32_t expected,
        int32_t store) {
  int ret;

  do {
    if (unlikely(atomic_LL(addr) != expected))
      return 1;
  } while (unlikely((ret = atomic_SC(addr, store))));
  return ret;

}

In other words, it takes ldrex/strex into well-known Linked-Load and Store Conditional, and with it it also implements the Compare-and-Swap semantics.

If your code does fine with only compare-and-swap, you can implement it for cortex-m0 like this:

static inline int atomic_CAS(volatile void *addr, int32_t expected,
        int32_t store) {
  int ret = 1;

   __interrupt_disable();
   if (*(volatile uint32_t *)addr) == expected) {
      *addr = store;
      ret = 0;
   }
   __interrupt_enable();
   return ret;
}

That's the most used pattern because some architectures only had it (x86 comes to mind). Implementing an emulation of LL/SC pattern by CAS seems ugly from where I stand. Specially when the SC is more than a few instructions apart from LL, but although very common, ARM doesn't recommend it specially in the Cortex-M3 case because as any interrupts will make strex fail, if you start to taking too long between ldrex/strex your code will spend a lot of time in a loop retrying strex. That's abusing the pattern, not using it.

As for your side question, in the cortex-m3 case the strex return in a register because the semantics were already defined by higher-level architectures (strex/ldrex exists in multi-core arms that were implemented before armv7-m was defined, and after it, where the cache controllers actually check for ldrex/strex addresses, i.e. strex fails only when the cache can't prove the dataline the address points wasn't modified).

If I were to speculate, I'd say it was because in early days this kind of atomics were designed thinking in libraries: you'd return success/failure in functions implemented in assembler and this would need to respect the ABI and most of them (all I know off) uses a register or stack, and not the flags, to return values. It could be also because compilers are better in using register colouring than to clobbering the flags in case some other instruction uses it, i.e. consider a complex operation which generates flags and in the mid of it you have a ldrex/strex sequence, and the operation that comes afterwards needs the flags: the compiler would have to move the flags to a register anyway.

I'm surprised that you describe the interrupt model for M0 as closer to ARMv4T than to ARMv7-M. My simplistic view is that ARMv6-M is a sub-set of ARMv7-M (making M0 code binary compatible with M3). — Sean Houlihane, Sep 11 '16 at 12:41
The devil is in the details. The M0 came after the M3, with the marketing promise that you can compile code for it and also run on the M3. That's almost true: generic user-mode library code is compatible; but you need a HAL for making anything else usable, IIRC even interrupt disable/enable isn't binary compatible. — Alexandre Pereira Nunes, Sep 13 '16 at 21:13
Doesn't change the fact that m0 has the same exception model, stacking behavior as m3, not the banked register approach. — Sean Houlihane, Sep 13 '16 at 21:30

score -2 · Answer 3 · answered Mar 14 '14 at 17:12

-2

STREX/LDREX are for multicore processors accessing shared items in memory that is shared across the cores. ARM did an unusually bad job of documenting that, you have to read between the lines in the amba/axi and arm and trm docs to figure this out.

How it works is IF you have a core that supports STREX/LDREX and IF you have a memory controller that supports exclusive access then IF the memory controller sees the pair of exclusive operations with no other core accessing that memory in between then you return EX_OKAY rather than OKAY. The arm docs tell the chip designers if it is a uniprocessor (not implementing the multicore feature) then you dont have to support exokay just return okay, which from a software perspective breaks the LDREX/STREX pair for accesses that hit that logic (the software spins in an infinite loop as it will never return success), the L1 cache does support it though so it feels like it works.

For uniprocessor and for cases where you are not accessing memory shared across the cores use SWP.

The -m0 does not support ldrex/strex nor swp, but what are those basically getting you? They are simply getting you an access that is not affected by you doing an access. to prevent you from stomping on yourself then just disable interrupts for the duration, the way we have done atomic accesses since the dark ages. if you want protection from you and a peripheral if you have a peripheral that can interfere, well there is no way to get around that and even a swap may not have helped.

So just disable interrupts around the critical section.

answered Mar 14 '14 at 17:12

old_timer

62,459
8
79
150

Since writing the question, and having to deal with the M0, I've taken to saving interrupt state, disabling interrupts, doing the desired action, and restoring interrupt state, but the ldrex/strex works fine on the M3; it basically says "perform the store if no interrupt has happened since the load". Unless there are so many interrupts that the ldrex/strex loop would fail repeatedly, I think it's cheaper to say: `retry: ldrex r1,[r0] / add r1,r1,#1 / strex r2,r1,[r0] / cmp r2,#0 / loop retry` than... – supercat Mar 14 '14 at 17:17
to do `mrs r2,PRIMASK / cpsid i / ldr r1,[r0] / add r1,r1,#1 / str r1,[r0] / msr PRIMASK,r2`, though I'll admit I'm not really sure how the timings work out. When I'd originally written the question, I thought it was necessary to handle interrupt state using the much more complicated approach employed by the EFM32 libraries; the ldrex/strex is a massive improvement over using the EFM32 routines. – supercat Mar 14 '14 at 17:25
geez I didnt even look at the date, I thought this was a new question, sorry...it still is very strange to me that the m3, armv7m came out first and the m0 armv6m came out later. I dont know if arm just offered them in that order or finding a chip vendor to use them is why one came out before the other, it sure did confuse a lot of folks that jumped on the thumb extensions only to find out that armv6 only had a few and armv7 had a hundred fifty or so. – old_timer Mar 14 '14 at 17:41
The Cortex-M0 is supposed to be cheaper and lower power than the M3; I sort of liked the ARM7-TDMI, though, which is what I cut my teeth on. The M3 is almost as good as the ARM7-TDMI's 32-bit instruction mode, but the 32-bit instruction mode could do a few things the M3 can't (e.g. I think I did something like `ldrh r0,[r10],#2 / ldrb r1,[pc, r0 lsr #12] / add pc,pc,r0 asl #2` to jump to one of 16 routines with a 1kbyte area based upon the top four bits of a fetched 16-bit word). The shift-right in `ldrb` is unusual, but was handy; the Thumb2 instruction set lacks that option, though. – supercat Mar 14 '14 at 18:19
BTW, if I remember right, the sequence exploited the fact that during the execution of the "ldrb" instruction, PC read as the current location plus 4, so the jump table started immediately after the `add pc` instruction. This was in code which needed to observe a value on another processor's address bus and reply with data within the cycle time of that other processor, so I didn't want to waste any cycles needlessly. – supercat Mar 14 '14 at 18:23
1

I agree, I assume the m3 is a replacement for the arm7 based microcontrollers, the -m0 is there to push the edges of power and size. just opinion no proof. – old_timer Mar 14 '14 at 18:36
Could you look at http://meta.stackexchange.com/questions/185837/could-some-of-the-arm-tags-be-merged/ and give some comments on cortex-m3, cortex-m0, etc. I proposed we should just get rid of them and only use cortex-m; but if you have a good argument otherwise? – artless noise Mar 17 '14 at 04:11
Note that there are multicore microcontrollers with Cortex-M0, e.g. LPC43xx or LPC541xx with one Cortex-M4 and one or two Cortex-M0. An option to have LDREX and friends for such coprocessors would be very useful, it will be hard to replace them by something using only ordinary memory accesses. – starblue Mar 16 '15 at 12:59
would be, but arm didnt necessarily design for that that is vendor specific thing, for which they could implement something but it wouldnt be a new instruction... – old_timer Mar 16 '15 at 15:10
1

Sorry, this is plain wrong: *STREX/LDREX are for multicore processors* - just read the M3 and M4 reference manual. `LDREX` and `STREX` are perfectly fine for implementing non-blocking atomic operations on the M3 and M4. – Venemo Oct 29 '15 at 08:10
@Venemo Agreed. STREX/LDREX are far more efficient than SWP even on a single core, too (which is why SWP is not part of the Thumb-2 instruction set). – cooperised Aug 02 '18 at 09:46

Simulating LDREX/STREX (load/store exclusive) in Cortex-M0

3 Answers3