4

Suppose I have several threads accessing the same memory location. And, if at all, they all write the same value and none of them reads it. After that, all threads converge (through locks) and only then I read the value. Do I need to use an atomic for this? This is for an x86_64 system. The value is an int32.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
Dany Bittel
  • 119
  • 1
  • 8
  • 1
    You tagged this assembly, does that mean you're writing that code in assembly? – harold Aug 25 '20 at 17:47
  • 2
    You’ve tagged this with both assembly and C and C++. The answer for assembly is definitely different from the other two. In assembly, each write to a dword is atomic if it is 4-byte aligned. (And usually even if it isn’t aligned.) – prl Aug 25 '20 at 17:47
  • The OP wants an answer to all 3 languages. – stackoverblown Aug 25 '20 at 17:54
  • 3
    As a general rule, if you think it needs to be atomic, it probably does. – doron Aug 25 '20 at 18:02
  • And who reads the memory? Or is it some hardware register? – Martin Rosenau Aug 25 '20 at 18:20
  • 3
    Do you care whose value appears in the memory location? Last to write or any old value? – Erik Eidt Aug 25 '20 at 18:30
  • 4
    In C and C++ you need an atomic, as at most one concurrent writer is allowed without synchronization. In addition in C and C++ the compiler isn't required to write *anything* to memory until a sync event (the *as-if* rule). – rustyx Aug 25 '20 at 18:32
  • Does the C language support atomic variables? – Thomas Matthews Aug 25 '20 at 18:41
  • 3
    @ThomasMatthews It does, as of [C11](https://en.cppreference.com/w/c/atomic). – G.M. Aug 25 '20 at 18:46
  • No, you do not _need_ to use an `atomic`, but if you choose not to, the variable needs to be protected from simultaneous access by [some other protective mechanism](https://queue.acm.org/detail.cfm?id=2088916). – ryyker Aug 25 '20 at 19:09
  • In general the answer is yes but with the specific restriction you describe the answer is no – 4386427 Aug 25 '20 at 19:42
  • You might prefer to instead give every thread its own result variable in an array and reduce over the array. While this might seem slower, you can probably do the reduction in parallel (have 1/n threads reduce n of the elements of the array, recurse log(n) times). – EOF Aug 25 '20 at 19:47
  • 1
    In theory, if you use a variable of atomic type and the platform supports atomic writes then the compiler will make the right decision of placing fences where needed (might be interesting to try this out) – M.M Aug 25 '20 at 21:12
  • 2
    Related for the general case without this locking pattern: [Why is integer assignment on a naturally aligned variable atomic on x86?](https://stackoverflow.com/q/36624881) / [Multithreading program stuck in optimized mode but runs normally in -O0](https://stackoverflow.com/q/58516052) / [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) (never, use mo_relaxed atomics) / – Peter Cordes Aug 25 '20 at 22:38
  • Yes you do because without atomic (or volatile but it is a bad idea) the compiler is free to **completely remove** a write that is never read from. – Zan Lynx Aug 26 '20 at 00:43
  • @ZanLynx How could the compiler completely remove a write to a regular variable (say `int`)? As opposed to *delay* the write. – curiousguy Aug 31 '20 at 15:49
  • @curiousguy: If there is a loop in a thread that writes a variable, but that variable is never read from in the loop, why shouldn't the compiler just remove it entirely? And it will. – Zan Lynx Aug 31 '20 at 17:18
  • @curiousguy I mean, why do you think thread libraries *invented* atomic variables, memory fences, compiler fences, etc? It was not just for fun. – Zan Lynx Aug 31 '20 at 17:20
  • @ZanLynx "_why do you think thread libraries invented atomic variables, memory fences, compiler fences, etc?_" If you only ever have *one* shared variable (one object that diff threads can change), like the cancel flag, how are fences useful? And without atomic vars, you don't even have a formal guarantee that two threads writing at the same time to the same object don't produce an invalid bit pattern (think misaligned `double` or `long double`)... So there are many use cases of atomics. – curiousguy Aug 31 '20 at 18:29
  • @ZanLynx "_If there is a loop_ (...)" That's my Q exactly. What allows it do that? How can one determine that a write is useless? – curiousguy Aug 31 '20 at 19:14

1 Answers1

10

According to §5.1.2.4 ¶25 and ¶4 of the official ISO C standard, two different threads writing to the same memory location using non-atomic operations in an unordered fashion causes undefined behavior. The ISO C standard makes no exception to this rule if all threads are writing the same value.

Although writing a 32-bit integer to a 4-byte aligned address is guaranteed to be atomic by the Intel/AMD specifications for x86/x64 CPUs, such an operation is not guaranteed to be atomic by the ISO C standard, unless you are using a data type that is guaranteed to be atomic by the ISO C standard (such as atomic_int_least32_t). Therefore, even if your threads write a value of type int32_t to a 4-byte aligned address, according to the ISO C standard, your program will still cause undefined behavior.

However, for practical purposes, it is probably safe to assume that the compiler is generating assembly instructions that perform the operation atomically, provided that the alignment requirements are met.

Even if the memory writes were not aligned and the CPU wouldn't execute the write instructions atomically, it is likely that your program will still work as intended. It should not matter if a write operation is split up into two write operations, because all threads are writing the exact same value.

If you decide not to use an atomic variable, then you should at least declare the variable as volatile. Otherwise, the compiler may emit assembly instructions that cause the variable to be only stored in a CPU register, so that the other CPUs may never see any changes to that variable.

So, to answer your question: It is probably not necessary to declare your variable as atomic. However, it is still highly recommended. Generally, all operations on variables that are accessed by several threads should either be atomic or be protected by a mutex. The only exception to this rule is if all threads are performing read-only operations on this variable.

Playing around with undefined behavior can be dangerous and is generally not recommended. In particular, if the compiler detects code that causes undefined behavior, it is allowed to treat that code as unreachable and optimize it away. In certain situations, some compilers actually do that. See this very interesting post by Microsoft Blogger Raymond Chen for more information.

Also, beware that several threads writing to the same location (or even the same cache line) can disrupt the CPU pipeline, because the x86/x64 architecture guarantees strong memory ordering which must be enforced. If the CPU's cache coherency protocol detects a possible memory order violation due to another CPU writing to the same cache line, the whole CPU pipeline may have to be cleared. For this reason, it may be more efficient for all threads to write to different memory locations (in different cache lines, at least 64 bytes apart) and to analyze the written data after all threads have been synchronized.

Andreas Wenzel
  • 4,984
  • 2
  • 7
  • 24
  • 4
    *it is probably safe to assume that the compiler is generating assembly instructions that perform the operation atomically* - Yes, but it's not safe to assume that the stores or loads happen *at all*. They can be sunk or hoisted out of loops, unless you roll your own atomics with `volatile` – Peter Cordes Aug 25 '20 at 22:25
  • 2
    False-sharing is a problem even on CPUs with weakly-ordered memory models. They still need to maintain coherency, so a store can't commit from the store buffer to L1d cache unless the line is exclusively owned by this core. (In MESI Exclusive or Modified state.) But yes, on x86 false sharing can also have the extra bad effect on readers of memory-order mis-speculation pipeline nukes. That's not a *stall* per-se; without speculative early loads x86 CPUs would have to *actually* do loads in program order and stall on every cache miss load. – Peter Cordes Aug 25 '20 at 22:28
  • This sounds wrong because according to Section 5.1.2.4 p25, one of the conditions for the undefined behavior is "and neither happens before the other." In this case, since all threads write the same value, the order becomes irrelevant so that part of p25 is not satisfied. Also, since none of them read the value until all reach the same synchronization point, the alignment doesn't matter. – Hadi Brais Aug 25 '20 at 22:32
  • 3
    @HadiBrais: You're assuming that 2 threads writing the same value don't conflict. But the standard does define the term "conflict" early more strictly: p4 of the same section: *Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.* No exception is made for writing the same value; this is technically data-race UB. – Peter Cordes Aug 25 '20 at 22:36
  • Update to my first comment: in practice the locking the OP describes probably makes it safe in practice, but might as well use `std::atomic` with `memory_order_relaxed` because that's free if it's just a single location. (If it's an array, it would defeat auto-vectorization.) – Peter Cordes Aug 25 '20 at 22:37
  • @PeterCordes Having conflicting actions is only one of the conditions. I agree that they are conflicting according to that definition. The second condition is that at least one of them is not atomic, which is also satisfied here. But the last condition isn't. All of the three conditions have to be met in order for us to call it undefined behavior. – Hadi Brais Aug 25 '20 at 22:38
  • @HadiBrais: You hit UB when the unsynchronized writes happen, even if the program exits without ever reading the variable. p25 - *The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.* Your argument that order doesn't matter there is I think still based on writing the same value not conflicting? We know from p4 that's not how the standard is worded. (Yes, this is much stricter than real implementations need.) – Peter Cordes Aug 25 '20 at 22:46
  • @HadiBrais: Or perhaps you misread p4? The condition is write + (read or write), not write + read. – Peter Cordes Aug 25 '20 at 22:48
  • @PeterCordes I fully realize that they are conflicting even if they are writing the same value. I'm not saying otherwise and my argument is not about that. A careful reading of p25 shows that it's not just about conflicting actions. The part that says *and neither happens before the other.* is the basis of my argument. When writing the same value, is this particular part of p25 satisfied or not? Although I'm not sure whether I understand this particular part of p25 correctly. There are many other similar questions on SO, but the accepted answers don't appear to be consistent or conclusive. – Hadi Brais Aug 25 '20 at 22:58
  • 1
    @HadiBrais: I was trying to guess what your exact argument was, thanks for clarifying. My understanding of that wording is that there's no "happens-before" relationship between the two assignments / stores. i.e. `x = 42;` in two separate threads without any mutual exclusion via locking or a release/acquire synchronization via some other variable. I don't see any way that writing the same value could change whether or not one "happens before" the other. – Peter Cordes Aug 25 '20 at 23:03
  • @HadiBrais: Hypothetically, a compiler could compile `y=41;` `x=y+1;` into asm that stores x=41 and increments it once, which can obviously break if another thread stores 42. Inventing writes is AFAIK allowed as long as one write to the object already exists in the current block (between any possible synchronization points like acq or rel operations, or branches). Or even just using `x` as arbitrary read+write scratch space and only leaving 42 at the end. – Peter Cordes Aug 25 '20 at 23:06
  • @PeterCordes Let's say we have two threads, T1 and T2, each writing once the same value to the same location. I can claim that the write from T1 always happens before the write from T2. Can you prove me wrong? Regarding your last comment, potentially a good point, but that's a different argument. I'm saying that the particular argument presented in this answer doesn't seem correct. – Hadi Brais Aug 25 '20 at 23:12
  • @HadiBrais: I could prove you wrong on a C++ implementation for a machine with hardware race detection that faults on this data-race UB, if it actually happens. That's one justification I've heard for ISO C++ data-race rules being so strict. They can declare things UB even when they're not observable by the C++ abstract machine. (And yes, my hypothetical only follows from my interpretation of what they mean by "happens before" - that kind of thread-unsafe code-gen is only allowed if it has its usual C++ technical meaning.) – Peter Cordes Aug 25 '20 at 23:17
  • @HadiBrais: Or another way to look at it: yes the UB only happens if there is any overlap between the stores, but writing a program with the *possibility* of UB depending on timing is usually a bad idea. We usually simplify that to "is UB". Note that there's no guarantee that writing a value is atomic unless you use an `atomic<>` object, so overlap is possible. Worst case, the writing is spread out over the full span of whatever block it's in, using some non-simple way of doing it like write something and then increment. – Peter Cordes Aug 25 '20 at 23:20
  • @PeterCordes "I could prove you wrong on a C++ implementation for a machine with hardware race detection that faults on this data-race UB" Alright, but I think the OP is talking about running the code on a typical x86 machine with no race detection. Oh yeah, the other issue with this answer I've mentioned earlier is alignment, which I don't think is required for atomicity because no thread is going to read the value before all threads reach the same sync point, even if the object is not `atomic<>`. – Hadi Brais Aug 25 '20 at 23:29
  • 3
    @HadiBrais: I'm not disputing that it will work in practice on real compilers, with synchronization before the read. Just that it's UB in ISO C++, and therefore a DeathStation 9000 compiler for x86 could do that `mov [x], 41` / `inc [x]` code-gen if it wanted to (with invented reads from x), *because* that only breaks code with data-race UB. And yes, as you say, atomicity isn't required if the unsynchronized part compiles to pure writes. (But you'll have atomicity anyway because real compilers use `alignof(int)=4`.) – Peter Cordes Aug 25 '20 at 23:54
  • @PeterCordes: Thanks for pointing out the need for the keyword `volatile`. I have added a paragraph in my answer to address this issue. – Andreas Wenzel Aug 26 '20 at 12:57
  • 1
    I think in the OP's specific case, `volatile` isn't necessary for it to happen to work despite UB. The locking the OP describes is probably not possible without effectively being a compiler memory barrier. All stores done at some point during a thread's lifetime will be visible after it exists. [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) - never, use atomic with mo_relaxed. If a plain variable wouldn't be safe, there's no middle ground where `volatile` is a better choice than `atomic` + `mo_relaxed`. (Or C++20 atomic_ref to allow some non-atomic access) – Peter Cordes Aug 26 '20 at 13:49
  • A better way to justify not needing `volatile` or `atomic` on real-world implementations: If only one of the threads had written a non-atomic object, there'd be no UB and `thread.join()` or similar sync would be sufficient to read the value without a data race. So the only change from that to this is multiple writers, and we know in practice that won't hurt on x86-64. (Unless a DeathStation 9000 does malicious code-gen that breaks on that multiple-writer data race UB.) Compile-time optimization doing inter-thread analysis to see the UB is implausible and probably wouldn't break anyway. – Peter Cordes Aug 26 '20 at 22:10
  • But there's very little reason not to use `atomic_int` with mo_relaxed. It needs a bunch of comments to justify omitting it, and warn of danger at the write and read sites. – Peter Cordes Aug 26 '20 at 22:11
  • 1
    @HadiBrais - you can't simply "claim" that one happens before the other: rather you'd need to show that one write happens-before the other based on one of the various rules described in the standard. That's not the case here: there is no happens-before relationship between the two independent writes. One possible point of confusion is that in the standard _happens before_ has a specific technical meaning, rather than just the lay meaning of "occurs earlier in time". – BeeOnRope Aug 30 '20 at 01:23