How is std::atomic implemented

Question

I'm studying the difference between mutex and atomic in C++11.

As my understanding, mutex is a kind of lock-mechanism, which is implemented based on the OS/kernel. For example, Linux offers a mechanism, which is futex. With the help of futex, we could implement mutex and semaphore. Furthermore, I've known that futex is implemented by the low-level atomic operation, such as CompareAndSet, CompareAndSwap.

For std::atomic, I've known that it is implemented based on the memory model, which is introduced by C++11. However, I don't know how the memory model is implemented at the low level. If it is also implemented by the atomic operation like CompareAndSet, what is the difference between std::atomic and mutex?

In a word, if std::atomic::is_lock_free gives me a false, well, I'm gonna say that std::atomic is the same with mutex. But if it gives me a true, how is it implemented at low level?

[Can num++ be atomic for 'int num'?](//stackoverflow.com/q/39393850) explains how x86 asm / hardware implements atomic operations, specifically RMW operations, with links to more stuff. Atomic operations are the building blocks that a mutex can be built from. Also [Atomicity on x86](//stackoverflow.com/a/38465341) and [Why is integer assignment on a naturally aligned variable atomic on x86?](//stackoverflow.com/q/36624881). — Peter Cordes, Dec 04 '19 at 12:59
Some of those touch on other ISAs where the hardware primitive is [LL/SC](https://en.wikipedia.org/wiki/Load-link/store-conditional) instead of directly supporting atomic add/sub/CAS, but mostly they're about x86. There are probably some other existing SO answers that cover other ISAs more generally. — Peter Cordes, Dec 04 '19 at 13:03
*is implemented based on the memory model, which is introduced by C++11.* - not exactly. Compilers have to implement the C++11 memory model semantics on top of *hardware* memory models, which have existed since way before C++11; OSes and hand-written asm mutexes for SMP systems need to be know the underlying memory model. They are different on a per-ISA basis. See https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html — Peter Cordes, Dec 04 '19 at 13:06
Related: https://preshing.com/20120930/weak-vs-strong-memory-models/ talks about some different HW memory models, where you need more or less barriers to implement acq_rel. Jeff Preshing's other articles are very good reading, too. BTW, this question may be too broad; I could just post asm for how some C++ statements compile on various ISAs (or you could look yourself on https://godbolt.org/ with optimization enabled), but without explaining the HW memory models for those ISAs IDK if it would answer your question. Really understanding lockless code is book-level amounts of info. — Peter Cordes, Dec 04 '19 at 13:12
Have a read of https://stackoverflow.com/questions/6319146/c11-introduced-a-standardized-memory-model-what-does-it-mean-and-how-is-it-g (Is almost a duplicate but includes a lot of extra information). — Richard Critten, Dec 04 '19 at 13:23
@PeterCordes Thanks a lot for your explanation, well, if `std::atomic` is implemented differently on different architectures, I think only one example based on some specific architecture is enough for me, for example, X86. — Yves, Dec 05 '19 at 01:05

score 6 · Accepted Answer · edited Dec 05 '19 at 05:21

If atomic operations are lock_free, they're probably implemented the same way the components of a mutex are implemented. After all, to lock a mutex, you really want some kind of atomic operation to ensure that one and only one thread locks the mutex.

The difference is, atomic operations that are lock free don't have a "locked" state. Let's compare two possible ways to do an atomic increment of a variable:

First, the mutex way. We lock a mutex, we read-increment-write the variable, then we unlock the mutex. If the thread gets interrupted during the read-increment-write, other threads that attempt to perform this same operation will block trying to lock the mutex. (See Where is the lock for a std::atomic? for how this works on some real implementations, for objects too large to be lock_free.)

Second, the atomic way. The CPU "locks" just the cache line holding the variable we want to modify for the duration of a single read-increment-write instruction. (This means the CPU delays responding to MESI requests to invalidate or share the cache line, keeping exclusive access so no other CPU can look at it. MESI cache coherency always requires exclusive ownership of a cache line before a core can modify it so this is cheap if we already owned the line). It is not possible for us to get interrupted during an instruction. Another thread that attempts to access this variable, at worst, has to wait for the cache coherency hardware to sort out who can modify the memory location.

So how do we lock a mutex? Likely we perform an atomic compare-and-swap. So light atomic operations are the primitives from which heavy mutex operations are assembled.

Of course this is all platform-specific. But this is what typical, modern platforms that you are likely to use do.

I added some clarifications to this, specifically about "cache locks" which use the same word as software mutex/spinlock locking but are *very* different. Related: [Can num++ be atomic for 'int num'?](//stackoverflow.com/q/39393850). Also to broaden this: note that [LL/SC machines](https://en.wikipedia.org/wiki/Load-link/store-conditional) (almost all the major non-x86 ISAs) use a "store conditional" that *checks* if we maintained exclusive ownership, instead of taking a cache lock at the start. Reducing worst-case latency for other cores at the cost of losing throughput to retries. — Peter Cordes, Dec 05 '19 at 05:25
Also, this doesn't mention anything about how `std::atomic` provides sequential-consistency (or optionally weaker ordering); that's an important part of what it does for pure-load and pure-store. (And RMWs on non-x86) — Peter Cordes, Dec 05 '19 at 05:27
(But that's partly the question's fault for being too broad, I think.) — Peter Cordes, Dec 05 '19 at 05:33

score 3 · Answer 2 · edited Jun 19 '20 at 20:55

what is the difference between std::atomic and mutex

A mutex is a concurrency construct, independent from any user data, offering lock and unlock method allowing you to protect (enforce mutual exclusion within) a region of code. You can put whatever you want in that region.

std::atomic<T> is an adapter over a single instance of a type T, allowing atomic access on a per operation basis to that object.

A mutex is more general in the sense that one possible implementation of std::atomic is to protect all access to the underlying object with a mutex.

std::atomic exists mostly because of the other common implementation: using an atomic instruction² to execute the operation directly without requiring a mutex. This is the implementation used when std::atomic<T>::is_lock_free() returns true. This is generally more efficient than the mutex approach, but is only applicable for objects small enough to be manipulated "in one shot" by atomic instructions.

² In some cases, the compiler is able to use plain instructions (rather than special concurrency related instructions) such as normal loads and stores if they offer the required guarantees on the platform in question.

For example, on x86, compilers implement all std::atomic loads, for small-enough values with plain loads, and implement all stores weaker than memory_order_seq_cst with plain stores. seq_cst stores are implemented with special instructions, however - a trailing mfence after mov on GCC before 10.1, and (implicit lock) xchg mem,reg on clang, recent GCC, and other compilers.

Note also the asymmetry between loads and stores is a compiler choice: they could have put the special handling on seq_cst loads instead, but because loads generally outnumber stores that's slower in most cases. (And because cheap loads in fast paths are more valuable.)

I think your footnote is confusing at best. If an operation isn't atomic from the POV of other threads, you can't use it to implement lock-free `std::atomic`. e.g. ARM can use `ldp` to load an aligned pair of registers as the implementation for `atomic::load(memory_order_relaxed)` *because* of documented guarantees that that's atomic for aligned loads on some microarchitectures. It's not a special instruction that exists *only* for its atomicity, if that's what you meant. Or if you mean x86 `add [mem], reg` without a `lock` prefix on a uniprocessor, that's atomic wrt. context switch — Peter Cordes, Dec 05 '19 at 02:56
@PeterCordes - maybe "non-atomic" isn't the best terminology. I just mean "plain" operations like say loads and stores which sometimes implement sufficient atomicity to implement some `std::atomic` methods (at least for some memory orders) without being a special, slow atomic operation at the ISA level. — BeeOnRope, Dec 05 '19 at 19:03

How is std::atomic implemented

2 Answers2