What's the difference between T, volatile T, and std::atomic?

Question

Given the following sample that intends to wait until another thread stores 42 in a shared variable shared without locks and without waiting for thread termination, why would volatile T or std::atomic<T> be required or recommended to guarantee concurrency correctness?

#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>

int main()
{
  int64_t shared = 0;
  std::thread thread([&shared]() {
    shared = 42;
  });
  while (shared != 42) {
  }
  assert(shared == 42);
  thread.join();
  return 0;
}

With GCC 4.8.5 and default options, the sample works as expected.

Why I voted to reopen: IMHO, this question and answer add value because they discuss the topic based on a specific example rather than broadly without specifics, making it harder to answer and require answers to be more general than in this case. Also, because there is a sample, the answer provide evidence to support claims on how volatile and std::atomic are supposed to work. — horstr, Mar 24 '21 at 22:02
Agreed. Specific should trump general except when the specific essentially repeats general. The former dupe should live on as a good reference, though. [Concurrency: Atomic and volatile in C++11 memory model](https://stackoverflow.com/questions/8819095/concurrency-atomic-and-volatile-in-c11-memory-model) — user4581301, Mar 24 '21 at 22:13
GCC's default options don't include optimization, so [*every* variable is treated somewhat like `volatile`](https://stackoverflow.com/questions/53366394/why-does-clang-produce-inefficient-asm-with-o0-for-this-simple-floating-point). That part is basically a duplicate of [Multithreading program stuck in optimized mode but runs normally in -O0](https://stackoverflow.com/q/58516052), and also [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) — Peter Cordes, Mar 24 '21 at 22:16
Only using `atomic` is correct, others are UB with at least race accesses (even with `volatile`). — Jarod42, Mar 25 '21 at 12:11
I don't thing the top paragraph of the question is a good place for self-justification. If anything, a comment (which already exists). Even a note at the bottom of the question would be too much, IMO, but maybe. Especially if you keep it much shorter, e.g. "this question is mostly a placeholder and example for my answer". (Where "my answer" can be a link to the help page explaining that self-answers are good, if you feel you need that. But I think at this point with the question having a positive score and close votes currently pending, you don't need further justification.) — Peter Cordes, Mar 30 '21 at 01:01

horstr · Accepted Answer · 2021-03-28T21:32:42.270

The test seems to indicate that the sample is correct but it is not. Similar code could easily end up in production and might even run flawlessly for years.

We can start off by compiling the sample with -O3. Now, the sample hangs indefinitely. (The default is -O0, no optimization / debug-consistency, which is somewhat similar to making every variable volatile, which is the reason the test didn't reveal the code as unsafe.)

To get to the root cause, we have to inspect the generated assembly. First, the GCC 4.8.5 -O0 based x86_64 assembly corresponding to the un-optimized working binary:

        // Thread B:
        // shared = 42;
        movq    -8(%rbp), %rax
        movq    (%rax), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
.L11:
        movq    -32(%rbp), %rax     # Check shared every iteration
        cmpq    $42, %rax
        jne     .L11

Thread B executes a simple store of the value 42 in shared. Thread A reads shared for each loop iteration until the comparison indicates equality.

Now, we compare that to the -O3 outcome:

        // Thread B:
        // shared = 42;
        movq    8(%rdi), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
        cmpq    $42, (%rsp)         # check shared once
        je      .L87                # and skip the infinite loop or not
.L88:
        jmp     .L88                # infinite loop
.L87:

Optimizations associated with -O3 replaced the loop with a single comparison and, if not equal, an infinite loop to match the expected behavior. With GCC 10.2, the loop is optimized out. (Unlike C, infinite loops with no side-effects or volatile accesses are undefined behaviour in C++.)

The problem is that the compiler and its optimizer are not aware of the implementation's concurrency implications. Consequently, the conclusion needs to be that shared cannot change in thread A - the loop is equivalent to dead code. (Or to put it another way, data races are UB, and the optimizer is allowed to assume that the program doesn't encounter UB. If you're reading a non-atomic variable, that must mean nobody else is writing it. This is what allows compilers to hoist loads out of loops, and similarly sink stores, which are very valuable optimizations for the normal case of non-shared variables.)

The solution requires us to communicate to the compiler that shared is involved in inter-thread communication. One way to accomplish that may be volatile. While the actual meaning of volatile varies across compilers and guarantees, if any, are compiler-specific, the general consensus is that volatile prevents the compiler from optimizing volatile accesses in terms of register-based caching. This is essential for low-level code that interacts with hardware and has its place in concurrent programming, albeit with a downward trend due to the introduction of std::atomic.

With volatile int64_t shared, the generated instructions change as follows:

        // Thread B:
        // shared = 42;
        movq    24(%rdi), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
.L87:
        movq    8(%rsp), %rax
        cmpq    $42, %rax
        jne     .L87

The loop cannot be eliminated anymore as it must be assumed that shared changed even though there's no evidence of that in the form of code. As a result, the sample now works with -O3.

If volatile fixes the issue, why would you ever need std::atomic? Two aspects relevant for lock-free code are what makes std::atomic essential: memory operation atomicity and memory order.

To build the case for load/store atomicity, we review the generated assembly compiled with GCC4.8.5 -O3 -m32 (the 32-bit version) for volatile int64_t shared:

        // Thread B:
        // shared = 42;
        movl    4(%esp), %eax
        movl    12(%eax), %eax
        movl    $42, (%eax)
        movl    $0, 4(%eax)

        // Thread A:
        // while (shared != 42) {
        // }
.L88:                               # do {
        movl    40(%esp), %eax
        movl    44(%esp), %edx
        xorl    $42, %eax
        movl    %eax, %ecx
        orl     %edx, %ecx
        jne     .L88                # } while(shared ^ 42 != 0);

For 32-bit x86 code generation, 64-bit loads and stores are usually split into two instructions. For single-threaded code, this is not an issue. For multi-threaded code, this means that another thread can see a partial result of the 64-bit memory operation, leaving room for unexpected inconsistencies that might not cause problems 100 percent of the time, but can occur at random and the probability of occurrence is heavily influenced by the surrounding code and software usage patterns. Even if GCC chose to generate instructions that guarantee atomicity by default, that still wouldn't affect other compilers and might not hold true for all supported platforms.

To guard against partial loads/stores in all circumstances and across all compilers and supported platforms, std::atomic can be employed. Let's review how std::atomic affects the generated assembly. The updated sample:

#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>

int main()
{
  std::atomic<int64_t> shared;
  std::thread thread([&shared]() {
    shared.store(42, std::memory_order_relaxed);
  });
  while (shared.load(std::memory_order_relaxed) != 42) {
  }
  assert(shared.load(std::memory_order_relaxed) == 42);
  thread.join();
  return 0;
}

The generated 32-bit assembly based on GCC 10.2 (-O3: https://godbolt.org/z/8sPs55nzT):

        // Thread B:
        // shared.store(42, std::memory_order_relaxed);
        movl    $42, %ecx
        xorl    %ebx, %ebx
        subl    $8, %esp
        movl    16(%esp), %eax
        movl    4(%eax), %eax       # function arg: pointer to  shared
        movl    %ecx, (%esp)
        movl    %ebx, 4(%esp)
        movq    (%esp), %xmm0       # 8-byte reload
        movq    %xmm0, (%eax)       # 8-byte store to  shared
        addl    $8, %esp

        // Thread A:
        // while (shared.load(std::memory_order_relaxed) != 42) {
        // }
.L9:                                # do {
        movq    -16(%ebp), %xmm1       # 8-byte load from shared
        movq    %xmm1, -32(%ebp)       # copy to a dummy temporary
        movl    -32(%ebp), %edx
        movl    -28(%ebp), %ecx        # and scalar reload
        movl    %edx, %eax
        movl    %ecx, %edx
        xorl    $42, %eax
        orl     %eax, %edx
        jne     .L9                 # } while(shared.load() ^ 42 != 0);

To guarantee atomicity for loads and stores, the compiler emits an 8-byte SSE2 movq instruction (to/from the bottom half of a 128-bit SSE register). Additionally, the assembly shows that the loop remains intact even though volatile was removed.

By using std::atomic in the sample, it is guaranteed that

std::atomic loads and stores are not subject to register-based caching
std::atomic loads and stores do not allow partial values to be observed

The C++ standard doesn't talk about registers at all, but it does say:

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

While that leaves room for interpretation, caching std::atomic loads across iterations, like triggered in our sample (without volatile or atomic) would clearly be a violation - the store might never become visible. Current compilers don't even optimize atomics within one block, like 2 accesses in the same iteration.

On x86, naturally-aligned loads/stores (where the address is a multiple of the load/store size) are atomic up to 8 bytes without special instructions. That's why GCC is able to use movq.

atomic<T> with a large T may not be supported directly by hardware, in which case the compiler can fall back to using a mutex.

A large T (e.g. the size of 2 registers) on some platforms might require an atomic RMW operation (if the compiler doesn't simply fall back to locking), which are sometimes provided with larger size than the largest efficient pure-load / pure-store that's guaranteed atomic. (e.g. on x86-64, lock cmpxchg16, or ARM ldrexd/strexd retry loop). Single-instruction atomic RMWs (like x86 uses) internally involve a cache line lock or a bus lock. For example, older versions of clang -m32 for x86 will use lock cmpxchg8b instead of movq for 8-byte pure-load or pure-store.

What's the second aspect mentioned above and what does std::memory_order_relaxed mean? Both, the compiler and CPU can reorder memory operations to optimize efficiency. The primary constraint of reordering is that all loads and stores must appear to have been executed in the order given by the code (program order). Therefore, in case of inter-thread communication, the memory order must be take into account to establish the required order despite reordering attempts. The required memory order can be specified for std::atomic loads and stores. std::memory_order_relaxed does not impose any particular order.

Mutual exclusion primitives enforce a specific memory order (acquire-release order) so that memory operations stay in the lock scope and stores executed by previous lock owners are guaranteed to be visible to subsequent lock owners. Thus, using locks, all the aspects raised here are addressed simply by using the locking facility. As soon as you break out of the comfort locks provide, you have to be mindful of the consequences and the factors that affect concurrency correctness.

Being as explicit as possible about inter-thread communication is a good starting point so that the compiler is aware of the load/store context and can generate code accordingly. Whenever possible, prefer std::atomic<T> with std::memory_order_relaxed (unless the scenario calls for a specific memory order) to volatile T (and, of course, T). Also, whenever possible, prefer not to roll your own lock-free code to reduce code complexity and maximize the probability of correctness.

*The test indicates that the sample is correct* - I'd suggest phrasing as "the test doesn't detect any problems". As you say, that doesn't mean there aren't any, but lockless code is notoriously hard to test, especially if you only have an x86 (where runtime reordering is limited, so only compile-time reordering can break some thing, unlike on ARM or POWER). — Peter Cordes, Mar 24 '21 at 22:22
Re: compilers keeping atomic values in registers: the ISO C++ standard's "as-if" rule does apply to atomics, with the language in the standard only talking about possible values that program might or must observe under certain conditions. But it turns out to be more problematic than originally realized, so current compilers treat `atomic` essentially like `volatile atomic`: [Why don't compilers merge redundant std::atomic writes?](https://stackoverflow.com/q/45960387) links WG21/P0062R1 and N4455. — Peter Cordes, Mar 24 '21 at 22:34
Thanks, @PeterCordes, I changed the sentence slightly to emphasise that the sample is indeed incorrect. — horstr, Mar 24 '21 at 22:38
Re: when to use `volatile` (and why exactly cache coherency makes it even work at all in practice, despite not avoiding data-race UB in ISO C++): [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) - pretty much never. — Peter Cordes, Mar 24 '21 at 22:39
My point about testing was that *that's the wrong way to think about a test*. A test that doesn't find a problem doesn't show that code is correct. This is fundamental in a language that has undefined behaviour, and especially for multi-threaded code when different test platforms can only exercise some kinds of reordering or stalls / thread sleeping / ordering. Especially if you aren't using a UB-sanitizer and/or race-detection simulator. (If you *are* using those things, you can be a lot more sure about successful test => no problem.) — Peter Cordes, Mar 24 '21 at 22:41
What platforms did you have in mind where a "cache-lock or bus-lock" is necessary? The phrasing seems to imply that software could explicitly take a cache-lock, but I'm pretty sure that's not what you meant because that's not how it works. (Something like x86 `lock cmpxchg8b [mem]` (or 16b in long mode) just asks the HW to do it atomically, and yes that will internally involve a cache-lock and unlock if it's not split across cache lines). So I find it strange that you bring all that up but never mention the fact that atomic RMWs are the main use-case for HW cache locking. — Peter Cordes, Mar 24 '21 at 22:48
Yes, @PeterCordes, thank again. I know about that question and the associated discussion. The sentence in the standard saying that "implementations should make atomic stores visible to atomic loads within a reasonable amount of time." indicates to me that the as-if rule cannot apply in this case because register caching would violate the notion of "making stores visible within a reasonable amount of time". You're definitely right in terms of redundant stores or the famous "progress bar" discussion (deferring the actual store until 100 % was reached), though. — horstr, Mar 24 '21 at 22:49
I fully agree. A test has little value in terms of determining concurrency correctness. — horstr, Mar 24 '21 at 22:51
Keeping a val in a register *indefinitely* is obviously ruled out by that, but there are lots of things compilers can do. e.g. an atomic store, and then reload of the same value in the same thread, could be assumed to always get the just-stored value. That's one of the possible values allowed by the standard, so the compiler could decide (at compile time) that's what always happens. Or more simply, for a couple loads of the same atomic var within the same loop iteration, with no other loads in between, it's legal for the compiler to decide they always get the same value. Or back 2 back stores — Peter Cordes, Mar 24 '21 at 22:53
@PeterCordes The point I intended to raise is that the implementation of std::atomic might have to resort to locking a cache line by using atomic instructions to perform the load/store. Of course, that's what the CPU handles internally and nothing that could be controlled using the instruction set - at least there's no architecture that comes to mind that would provide that level of control. But yes, it does not sound like a consequence of the atomic instruction in the original text. Maybe I can reword that to clear this up. Thanks. — horstr, Mar 24 '21 at 22:55
... cache locks and so on. IIRC, some ARM versions can only do an 8-byte atomic store when it's part of an LL/SC. (LDREXD/STREXD). Is that what you meant? That does have to commit the whole store to cache at once, even if that takes 2 cycles. But making the whole RMW atomic does *not* take a cache lock. The store simply fails if the whole RMW could have appeared non-atomic to another observer, by monitoring that cache line to see if we lost ownership of it, for example. Other ARMs can do 8-byte atomic load/store like x86, as a pure load or store (but with a pair of GP regs). — Peter Cordes, Mar 24 '21 at 22:58
@PeterCordes AFAICR, atomic instructions implied a bus lock before the P6 family of x86 processors. Starting with P6, the cache line is locked instead. — horstr, Mar 24 '21 at 22:59
Yeah, I thought you were talking about `lock`ed instructions that do an atomic RMW. Aligned `mov` is an atomic instruction ([Why is integer assignment on a naturally aligned variable atomic on x86?](//stackoverflow.com/q/36624881)). If we're just talking about x86, [Intel's documentation](//www.felixcloutier.com/x86/lock) *still* just says it asserts a `LOCK#` signal, *because the internals only matter for perf*. (e.g. scalability of multiple cores each doing an atomic RMW on a different cache line, and latency of not going off-core). Cache-line-split `lock`s are extremely slow, bus lock. — Peter Cordes, Mar 24 '21 at 23:06
Summing up, I did not mean to imply that a cache line lock is necessary, but might be required to guarantee atomicity depending on sizeof(T) and the target platform. The main point was: The implementation might cause an ordinary load/store to be generated (one end) or an actual high-level mutual exclusion primitive might be required (the other end of the spectrum). — horstr, Mar 24 '21 at 23:07
Yeah, I thought so, but I found the phrasing clunky, and worried it could be confusing to readers who didn't already understand the CPU-architecture details, and how compilers implemented pure-load / pure-store for `std::atomic` for sizes small enough to be "always_lock_free". — Peter Cordes, Mar 24 '21 at 23:10
Overall good answer, although perhaps redundant with the existing answers on [Multithreading program stuck in optimized mode but runs normally in -O0](https://stackoverflow.com/q/58516052). Perhaps you'd want to post it there, although that question uses a different example, and you intentionally chose `int64_t` so you could use a 32-bit ISA to make a point about atomicity. Anyway, I'll edit and try to improve the phrasing of a couple things I brought up, and link other existing Q&As to support some relevant points you make. — Peter Cordes, Mar 24 '21 at 23:14
Maybe that's the misunderstanding and my mistake in not being more verbose. When I say atomic instruction in respect to x86, I refer to using the `LOCK` prefix. You're right that an aligned `mov` on x86 is atomic, but I wouldn't consider it to be an "atomic instruction" in the sense of a category of instructions that involve either a cache or bus lock and a come with a significant overhead. — horstr, Mar 24 '21 at 23:18
So yes. what my text is missing here is the differentiation between atomic instructions as in an aligned mov (pure load/store) and atomic instructions as in read-modify-write instructions that require a cache line or bus lock - at least in case of x86. — horstr, Mar 24 '21 at 23:31
I think in the context of answering a question about `std::atomic::load`, it's best to use `lock`ed instruction when that's what you mean. "atomic instruction" is way more ambiguous. I assumed you were using it as a clumsy way to describe atomic *RMW* instructions, but readers that don't already understand this might not. If you want to include stuff like LL/SC retry loops for non-x86 ISAs, "atomic RMW instructions" would still include that, but "`lock`ed instructions" wouldn't. Your phrasing still strangely excludes LL/SC instructions because they *don't* take a cache or bus lock. :/ — Peter Cordes, Mar 24 '21 at 23:32
Ok, I made a fairly substantial edit. You might want to tighten things back up and / or put them into your own words, if you liked your phrasing or formatting better for any of the parts I changed. — Peter Cordes, Mar 25 '21 at 00:03
@PeterCordes Thanks for the edit. The added references and the clarification in respect to the atomicity implementation definitely add value to the answer! Considering the minimal complexity of the assembly, the added comments might not have been necessary, but now it's definitely easier for the reader to follow the train of thought. Thanks again for taking the time to review and edit the answer! — horstr, Mar 25 '21 at 00:25

What's the difference between T, volatile T, and std::atomic?

1 Answers1