Is assigning a pointer in C program considered atomic on x86-64

Question

https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html#Atomic-Types says - In practice, you can assume that int is atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of.

My question is whether pointer assignment can be considered atomic on x86_64 architecture for a C program compiled with gcc m64 flag. OS is 64bit Linux and CPU is Intel(R) Xeon(R) CPU D-1548. One thread will be setting a pointer and another thread accessing the pointer. There is only one writer thread and one reader thread. Reader should either be getting the previous value of the pointer or the latest value and no garbage value in between.

If it is not considered atomic, please let me know how can I use the gcc atomic builtins or maybe memory barrier like __sync_synchronize to achieve the same without using locks. Interested only in C solution and not C++. Thanks!

First, there are almost certainly ways to make modifications to a pointer non-atomic even on x86 hardware - `packed` jumps out immediately. Second, if you're wrong, you've just introduced a wonderful little [Heisenbug](https://en.wikipedia.org/wiki/Heisenbug) that you will **never** find. In short, **never** "assume". — Andrew Henle, Aug 03 '20 at 16:34
The link seems to refer to atomicity between a signal handler and a baseline thread [on the _same_ CPU] and _not_ between threads on multiple cores. You'll need some primitives from `atomic.h`. But, even with atomic access, how does the reader thread _know_ when the writer thread has posted a new value? The reader could loop on _stale_ values. You need some other sync mechanism (e.g. `sem_wait/sem_post`) or a sequence number or other. — Craig Estey, Aug 03 '20 at 16:36

Maxim Egorushkin · Accepted Answer · 2020-08-03T17:04:22.697

6

Bear in mind that atomicity alone is not enough for communicating between threads. Nothing prevents the compiler and CPU from reordering previous/subsequent load and store instructions with that "atomic" store. In old days people used volatile to prevent that reordering but that was never intended for use with threads and doesn't provide means to specify less or more restrictive memory order (see "Relationship with volatile" in there).

You should use C11 atomics because they guarantee both atomicity and memory order.

edited Aug 03 '20 at 17:04

answered Aug 03 '20 at 16:36

Maxim Egorushkin

119,842
14
147
239

3

Usually people that want to avoid `#include ` and `_Atomic` want that because they think it's less efficient. `memory_order_relaxed` usually compiles to the same asm as you could get with `volatile` and/or` asm("" ::: "memory")` barriers to make things safe (non-portably, on a specific implementation) without `_Atomic`. Related: [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) - as you say, never. – Peter Cordes Aug 03 '20 at 16:53

score 3 · Answer 2 · answered Aug 03 '20 at 17:00

For almost all architectures, pointer load and store are atomic. A once notable exception was 8086/80286 where pointers could be seg:offset; there was an l[des]s instruction which could make an atomic load; but no corresponding atomic store.

The integrity of the pointer is only a small concern; your bigger issue revolves around synchronization: the pointer was at value Y, you set it to X; how will you know when nobody is using the (old) Y value? A somewhat related problem is that you may have stored things at X, which the other thread expects to find. Without synchronization, other might see the new pointer value, however what it points to might not be up to date yet.

Peter Cordes · Answer 3 · 2020-08-03T18:30:27.490

A plain global char *ptr should not be considered atomic. It might work sometimes, especially with optimization disabled, but you can get the compiler to make safe and efficient optimized asm by using modern language features to tell it you want atomicity.

Use C11 stdatomic.h or GNU C __atomic builtins. And see Why is integer assignment on a naturally aligned variable atomic on x86? - yes the underlying asm operations are atomic "for free", but you need to control the compiler's code-gen to get sane behaviour for multithreading.

See also LWN: Who's afraid of a big bad optimizing compiler? - weird effects of using plain vars include several really bad well-known things, but also more obscure stuff like invented loads, reading a variable more than once if the compiler decides to optimize away a local tmp and load the shared var twice, instead of loading it into a register. Using asm("" ::: "memory") compiler barriers may not be sufficient to defeat that depending on where you put them.

So use proper atomic stores and loads that tell the compiler what you want: You should generally use atomic loads to read them, too.

#include <stdatomic.h>            // C11 way
_Atomic char *c11_shared_var;     // all access to this is atomic, functions needed only if you want weaker ordering

void foo(){
   atomic_store_explicit(&c11_shared_var, newval, memory_order_relaxed);
}

char *plain_shared_var;       // GNU C
// This is a plain C var.  Only specific accesses to it are atomic; be careful!

void foo() {
   __atomic_store_n(&plain_shared_var, newval, __ATOMIC_RELAXED);
}

Using __atomic_store_n on a plain var is the functionality that C++20 atomic_ref exposes. If multiple threads access a variable for the entire time that it needs to exist, you might as well just use C11 stdatomic because every access needs to be atomic (not optimized into a register or whatever). When you want to let the compiler load once and reuse that value, do char *tmp = c11_shared_var; (or atomic_load_explicit if you only want acquire instead of seq_cst; cheaper on a few non-x86 ISAs).

Besides lack of tearing (atomicity of asm load or store), the other key parts of _Atomic foo * are:

The compiler will assume that other threads may have changed memory contents (like volatile effectively implies), otherwise the assumption of no data-race UB will let the compiler hoist loads out of loops. Without this, dead-store elimination might only do one store at the end of a loop, not updating the value multiple times.

The read side of the problem is usually what bites people in practice, see Multithreading program stuck in optimized mode but runs normally in -O0 - e.g. while(!flag){} becomes if(!flag) infinite_loop; with optimization enabled.
Ordering wrt. other code. e.g. you can use memory_order_release to make sure that other threads that see the pointer update also see all changes to the pointed-to data. (On x86 that's as simple as compile-time ordering, no extra barriers needed for acquire/release, only for seq_cst. Avoid seq_cst if you can; mfence or locked operations are slow.)
Guarantee that the store will compile to a single asm instruction. You'd be depending on this. It does happen in practice with sane compilers, although it's conceivable that a compiler might decide to use rep movsb to copy a few contiguous pointers, and that some machine somewhere might have a microcoded implementation that does some stores narrower than 8 bytes.

(This failure mode is highly unlikely; the Linux kernel relies on volatile load/store compiling to a single instruction with GCC / clang for its hand-rolled intrinsics. But if you just used asm("" ::: "memory") to make sure a store happened on a non-volatile variable, there's a chance.)

Also, something like ptr++ will compile to an atomic RMW operation like lock add qword [mem], 4, rather than separate load and store like volatile would. (See Can num++ be atomic for 'int num'? for more about atomic RMWs). Avoid that if you don't need it, it's slower. e.g. atomic_store_explicit(&ptr, ptr + 1, mo_release); - seq_cst loads are cheap on x86-64 but seq_cst stores aren't.

Also note that memory barriers can't create atomicity (lack of tearing), they can only create ordering wrt other ops.

In practice x86-64 ABIs do have alignof(void*) = 8 so all pointer objects should be naturally aligned (except in a __attribute__((packed)) struct which violates the ABI, so you can use __atomic_store_n on them. It should compile to what you want (plain store, no overhead), and meet the asm requirements to be atomic.

See also When to use volatile with multi threading? - you can roll your own atomics with volatile and asm memory barriers, but don't. The Linux kernel does that, but it's a lot of effort for basically no gain, especially for a user-space program.

Side note: an often repeated misconception is that volatile or _Atomic are needed to avoid reading stale values from cache. This is not the case.

All machines that run C11 threads across multiple cores have coherent caches, not needing explicit flush instructions in the reader or writer. Just ordinary load or store instructions, like x86 mov. The key is to not let the compiler keep values of shared variable in CPU registers (which are thread-private). It normally can do this optimization because of the assumption of no data-race Undefined Behaviour. Registers are very much not the same thing as L1d CPU cache; managing what's in registers vs. memory is done by the compiler, while hardware keeps cache in sync. See When to use volatile with multi threading? for more details about why coherent caches is sufficient to make volatile work like memory_order_relaxed.

See Multithreading program stuck in optimized mode but runs normally in -O0 for an example.

score -1 · Answer 4 · answered Aug 03 '20 at 16:33

-1

"Atomic" is treated as this quantum state where something can be both atomic and not atomic at the same time because "it's possible" that "some machines" "somewhere" "might not" write "a certain value" atomically. Maybe.

That is not the case. Atomicity has a very specific meaning, and it solves a very specific problem: threads being pre-empted by the OS to schedule another thread in its place on that core. And you cannot stop a thread from executing mid-assembly instruction.

What that means is that any single assembly instruction is "atomic" by definition. And since you have registry moving instructions, any register-sized copy is atomic by definition. That means a 32-bit integer on a 32-bit CPU, and a 64-bit integer on a 64-bit CPU are all atomic -- and of course that includes pointers (ignore all the people who will tell you "some architectures" have pointers of "different size" than registers, that hasn't been the case since 386).

You should however be careful not to hit variable caching problems (ie one thread writing a pointer, and another trying to read it but getting an old value from the cache), use volatile as needed to prevent this.

answered Aug 03 '20 at 16:33

Blindy

55,135
9
81
120

2

*What that means is that any single assembly instruction is "atomic" by definition* The problem is in **proving** that one line of C code will **always** be translated into a single instruction. You can never do that. – Andrew Henle Aug 03 '20 at 16:35
Not trying to prove that, only that reading or writing an integer (or a pointer) is atomic. Whatever operations you do on it aren't. – Blindy Aug 03 '20 at 16:37
5

"Any single assembly instruction is atomic by definition": Again, depending on your definition. A read-modify-write instruction like `add mem, reg` is atomic in the sense that it won't be interrupted on its own core, but non-atomic in the sense that another core could write that location in between the "read" and "write" parts of the instruction. I realize that is not the issue here but the sentence could be misleading out of context. – Nate Eldredge Aug 03 '20 at 16:39
Yes, you are. You are implicitly assuming loads or stores generated by a compiler to implement your C code will be done in a single instruction. And even by doing that, ordering and visibility guarantees of atomic operations are ignored. – Andrew Henle Aug 03 '20 at 16:40
1

*getting an old value from the cache* - no, that isn't the problem. It's getting an old value *from a register*, assuming that memory hasn't changed. i.e. turning `while(!flag){}` into `if(!flag){ infinite_loop; }` by hoisting a non-atomic / non-volatile load. All real-world C11 thread implementations run on hardware with coherent caches, please stop spreading this mis-conception that cache can be stale as the explanation for why you need `_Atomic` or `volatile`. – Peter Cordes Aug 03 '20 at 16:57
(In theory you could possibly implement C11 on a machine that needed explicit flush instructions even on `relaxed` operations to make a store visible "in a reasonable amount of time", but in practice if you have a heterogeneous ARM with microcontroller + DSP or something that aren't coherent, you don't run threads of the same program across those cores.) – Peter Cordes Aug 03 '20 at 16:58
I agree with @AndrewHenle; use `atomic_store_explicit(&var, value, memory_order_relaxed)` to make sure you get a single-instruction store, as recommended in [Why is integer assignment on a naturally aligned variable atomic on x86?](https://stackoverflow.com/q/36624881). This has the advantage over `volatile` that you can use `mo_release` to make sure that store is ordered after stores to the pointed-to data, and that ISO C11 guarantees the correctness of your program, rather than mostly unwritten rules and assumptions that roll-your-own atomics (like the Linux kernel's) depend on. – Peter Cordes Aug 03 '20 at 17:05
1

Re “And you cannot stop a thread from executing mid-assembly instruction”: Some processors have interruptible instructions. Not likely a compiler would use them for updating a stored pointer, but it shows the incorrectness and hazard of making assumptions like this. ARM has interruptible load/store multiple register instructions, and somebody (I forget whether VAX, IBM, or Intel) has an interruptible copy-bytes instruction. – Eric Postpischil Aug 03 '20 at 17:22
@EricPostpischil: x86-64 `rep movsb` (memcpy in a can) is interruptible. Compilers do sometimes inline `rep movsb` for small fixed-size copies, or at least used to, but it's rare these days because SIMD load/store avoids microcode startup overhead. (Unless Ice Lake's "fast short rep" feature changes that tuning heuristic.) But yes it's conceivable that a compiler could decide to basically memcpy a couple pointers, including the one you care about, and that some CPU might execute it internally as narrower than 8-byte stores. (e.g. with the "fast strings" control reg bit cleared) – Peter Cordes Aug 03 '20 at 17:27
So many "in theory", "could be" and "some processors" and whatnot. None of us are talking about those processors, none of us are talking about `rep movsb` (since I specifically said register-sized variables), and so on. Armchair systems designers aside, what this guy is trying to do doesn't need explicit atomic support from the x86/x86_64 instruction set, and none of us care about your other processors. – Blindy Aug 03 '20 at 17:30
1

@Blindy Are you saying that if my code does `c = 0; d = 0; c = 1;` it can't be changed by the compiler to `c = 0; d = 0; c++;`? Or `c = 0; d = 0; c = d + 1;`? Where do I find this guarantee? – David Schwartz Aug 03 '20 at 17:31
@Blindy: Right, it doesn't need any special x86-64 *asm*, you just need special C syntax to make sure you get the right asm. `volatile` is not the recommended way, and is kind of a dead-end path because as soon as you need any ordering stronger than `relaxed`, you'd need to use inline asm or switch to something else. – Peter Cordes Aug 03 '20 at 17:47
The `rep movsb` example was for a compiler compiling assignments to multiple plain `char *` objects that *happen* to be adjacent into one `rep movsb`. e.g. if they're in a struct or array, or the compiler just happens to know that globals are adjacent. GCC in practice *does* merge `=` assignments to adjacent struct members so you could expect a `movaps` or even unaligned `movups` for setting multiple pointers in a struct. In practice we're sure those are [at least per-element atomic on real CPUs, but there's no documented guaranteed](https://stackoverflow.com/q/46012574/224132). – Peter Cordes Aug 03 '20 at 17:49
But yes, the `rep movsb` possible problem is mostly hypothetical, not a practical concern I think. Still, the same language features that give other good stuff also prevent that problem. – Peter Cordes Aug 03 '20 at 17:51
@DavidSchwartz, real programming isn't about theoretical questions or fear or whatever else drives you, it's easy to check exactly what's going on: https://godbolt.org/z/jaYYMW (all 3 atomic assignments, in case you can't tell). I will try to make this as clear as I can: your theoretical questions hold absolutely no value to me. – Blindy Aug 03 '20 at 17:57
@PeterCordes, that's fine, let it optimize its heart away, that's still atomic, documented or not. Everything else is theoretical at best. – Blindy Aug 03 '20 at 17:58
1

Something that happens to work in one case doesn't prove it for the general case. That's part of what makes debugging, testing, and reasoning about lockless code hard. Different surrounding code could create problems. I suggest you read this LWN article: [Who's afraid of a big bad optimizing compiler?](https://lwn.net/Articles/793253/) which points out multiple real-word gotchas that can happen with GCC. – Peter Cordes Aug 03 '20 at 18:03
4

@Blindy I've been a real programmer for several decades now and I've gotten severely burned too many times by assuming that future optimizers wouldn't make particular transformations to my code. What you're suggesting people do -- assume particular future optimizations will not be possible or likely -- is a recipe for disaster. And for what? Using appropriate atomic operations has *no* performance cost and makes the code easier to maintain and understand. – David Schwartz Aug 03 '20 at 18:22
@DavidSchwartz: Is there any document that fully and accurately describes the language that the authors of clang and gcc seek to process? From what I can tell, they regard situations where they view the standard as needlessly restricting optimizations as defects in the Standard, and makes no attempt to uphold the Standard as written, but I'm unaware of any document that describes all the cases where they do that. Most situations where I've seen optimizations prove problematic are a result of compiler writers willfully ignoring the Committee intentions as documented in the Rationale. – supercat Aug 03 '20 at 22:06

Is assigning a pointer in C program considered atomic on x86-64

4 Answers4