Multithreading program stuck in optimized mode but runs normally in -O0

Question

I wrote a simple multithreading programs as follows:

static bool finished = false;

int func()
{
    size_t i = 0;
    while (!finished)
        ++i;
    return i;
}

int main()
{
    auto result=std::async(std::launch::async, func);
    std::this_thread::sleep_for(std::chrono::seconds(1));
    finished=true;
    std::cout<<"result ="<<result.get();
    std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}

It behaves normally in debug mode in Visual studio or -O0 in gcc and print out the result after 1 seconds. But it stuck and does not print anything in Release mode or -O1 -O2 -O3.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/201432/discussion-on-question-by-sz-ppeter-multithreading-program-stuck-in-optimized-mo). — Samuel Liew, Oct 25 '19 at 21:37

Scheff's Cat · Accepted Answer · 2019-10-25T05:40:26.820

102

Two threads, accessing a non-atomic, non-guarded variable are U.B. This concerns finished. You could make finished of type std::atomic<bool> to fix this.

My fix:

#include <iostream>
#include <future>
#include <atomic>

static std::atomic<bool> finished = false;

int func()
{
    size_t i = 0;
    while (!finished)
        ++i;
    return i;
}

int main()
{
    auto result=std::async(std::launch::async, func);
    std::this_thread::sleep_for(std::chrono::seconds(1));
    finished=true;
    std::cout<<"result ="<<result.get();
    std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}

Output:

result =1023045342
main thread id=140147660588864

Live Demo on coliru

Somebody may think 'It's a bool – probably one bit. How can this be non-atomic?' (I did when I started with multi-threading myself.)

But note that lack-of-tearing is not the only thing that std::atomic gives you. It also makes concurrent read+write access from multiple threads well-defined, stopping the compiler from assuming that re-reading the variable will always see the same value.

Making a bool unguarded, non-atomic can cause additional issues:

The compiler might decide to optimize variable into a register or even CSE multiple accesses into one and hoist a load out of a loop.
The variable might be cached for a CPU core. (In real life, CPUs have coherent caches. This is not a real problem, but the C++ standard is loose enough to cover hypothetical C++ implementations on non-coherent shared memory where atomic<bool> with memory_order_relaxed store/load would work, but where volatile wouldn't. Using volatile for this would be UB, even though it works in practice on real C++ implementations.)

To prevent this to happen, the compiler must be told explicitly not to do.

I'm a little bit surprised about the evolving discussion concerning the potential relation of volatile to this issue. Thus, I'd like to spent my two cents:

edited Oct 25 '19 at 05:40

answered Oct 23 '19 at 05:39

Scheff's Cat

16,517
5
25
45

Thanks, I understand that might be the reason why it works only in Debug mode. Another question I need some clarification. Does a resource need to be protected even when other threads are only reading it, like the global bool I have here? – sz ppeter Oct 23 '19 at 05:56
1

@szppeter It need not to be guarded if _all_ involved threads are _reading only_. In your case, one thread (the main) writes. The other thread reads but it doesn't recognize the modification. That it works in debug mode is just an unhappy accident. U.B. is U.B. "It works" is one kind of U.B. (the most underhanded). ;-) – Scheff's Cat Oct 23 '19 at 05:58
@szppeter For a better learning effect, I would suggest you, to have a deep look into mutex, locks and so on... Atomic does quite a view things, that a key for multitasking, that you should understand. – skratchi.at Oct 23 '19 at 06:13
4

I took one look at `func()` and thought "I could optimise that away" The optimiser doesn't care for threads at all, and will detect the infinite loop, and will happily turn it into a "while(True)" If we look at https://godbolt.org/z/Tl44iN we can see this. If finished is `True` it returns. If not, it goes into an unconditional jump back to itself (an infinite loop) at label `.L5` – Baldrickk Oct 23 '19 at 16:21
You probably should never actually do this, but would making variable `volatile` fix this too at less cost? – val is still with Monica Oct 23 '19 at 19:13
3

@val No. [SO: When to use volatile with multi threading?](https://stackoverflow.com/a/4558031/7478597) – Scheff's Cat Oct 23 '19 at 20:23
1

*The variable might be cached for a CPU core.* Possible on a hypothetical system with non-coherent shared memory, but in real life nobody runs multi-threaded C++ programs across non-coherent shared memory. (You use MPI or whatever to use it for low-overhead message passing with explicit flushes of specific regions). See https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/. It's not actual CPU caches that are the problem, it's that assuming no UB allows the compiler to hoist the load out of the loop and make asm like `if(!finished) { infinite_loop() }` – Peter Cordes Oct 24 '19 at 04:22
In other words, it's not atomicity per-se (lack of tearing) that's the problem here, it's that (assumption of no) data-race UB allows the compiler to hoist and CSE loads of globals. So yes, @val, the part of `std::atomic` that we need here is exactly the same as what `volatile` gives us in practice on real implementations, and how code like this was written before C++11, when threads weren't part of the language. In C++11, `volatile` doesn't prevent data-race UB, but on real implementations it's pretty much `atomic` with `memory_order_relaxed`, and operations like `++` aren't atomic-RMW – Peter Cordes Oct 24 '19 at 04:36
2

@val: there's basically no reason to abuse `volatile` in C++11 because you can get identical asm with `atomic` and `std::memory_order_relaxed`. It does work though on real hardware: caches are coherent so a load instruction can't keep reading a stale value once a store on another core commits to cache there. (MESI) – Peter Cordes Oct 24 '19 at 04:38
5

@PeterCordes Using `volatile` is still UB though. You really should never assume something that is definitely and clearly UB is safe just because you can't think of a way it could go wrong and it worked when you tried it. That has gotten people burned over and over. – David Schwartz Oct 24 '19 at 05:52
1

@DavidSchwartz: Yes, of course it is. I thought my comments were pretty clearly saying that you shouldn't actually *use* `volatile` this way, just that this part of the explanation is misleading for people's understanding of how CPUs work. There is a widespread misconception that reading stale data *from cache* is a real problem on mainstream CPUs. It's not, only when the compiler caches a value *in a register* (per-thread private) or CSEs away multiple accesses entirely. Anyway, I added an answer on [When to use volatile with multi threading?](//stackoverflow.com/a/58535118). – Peter Cordes Oct 24 '19 at 06:12
@DavidSchwartz: In C, the semantics of volatile have always been implementation-defined, and at the time the Standard was written the simplest (and thus most common) treatment was to regard a volatile load or store as though it might affect any object that wasn't qualified `register`. While C++ is a different language, it inherited many concepts from C without bothering to fully re-specify all of them. – supercat Oct 24 '19 at 17:28
While this answer is good insofar as it makes perfectly correct claims (data race / UB, 4.7.1/20.1) and provides a correct, working solution, I believe that the **real** reason why it works in debug and doesn't work in optimized build is a different one. The reason is that both setting `finished` to whatever value within `main` and checking for it in `func` is entirely inconsequential on a per-thread scope, so the compiler simply doesn't do it at all. – Damon Oct 24 '19 at 19:53
@Damon Why U.B. does work under certain conditions is interesting and exciting but IMHO less relevant than how are things done right. There are a lot of cases where U.B. is simply unnoticed by causing no visual effect on your common platform (sometimes, until the one day before release when it suddenly starts to pain you). ;-) I was under impression that the [other answer](https://stackoverflow.com/a/58527396/7478597) covered that part, didn't it? (The 2nd link, I just added at the bottom, might provide deeper insights concerning this as well.) – Scheff's Cat Oct 25 '19 at 06:00
That's right, what I want to point out (and maybe failed to do) is that UB sounds like the well-known "demons fly out of your nose" thing. Whereas in this case (and many cases) UB is rather a harmless "unspecified, don't care". Although UB, it _would_ actually "work" (...and fail sporadically, but not systematically). It does however not matter here whether the `bool` is being update atomically or whether the happens-before guarantees are strictly met or not. Inside `func` there is no way the value _could_ change, and inside `main` there is no consequence of setting it. If a mutex was used... – Damon Oct 25 '19 at 09:08
... (cont) then the UB would be gone (without using an atomic variable), but it would still not work. Because the UB _per se_ isn't the reason it doesn't work, it's because the optimizer is (correctly) stripping out stuff that is entirely irrelevant. – Damon Oct 25 '19 at 09:09
@Damon There is the distinction between Undefined Behavior vs. Implementation-Defined Behavior. That U.B. results in implementation defined behavior on a specific real system doesn't mean that this might be true in general as well as this might not change in the future. IMHO, this can be reduced to "Is this a data race?" (I think so, [Race Conditions versus Data Races](https://www.modernescpp.com/index.php/race-condition-versus-data-race)) and "Is a data race U.B?" (I found [Wait, data races are undefined behavior?](https://news.ycombinator.com/item?id=16250729)) – Scheff's Cat Oct 25 '19 at 09:15
3

@Damon Mutexes have release/acquire semantics. The compiler is not allowed to optimize the read away if a mutex was locked before, so protecting `finished` with a `std::mutex` works (without `volatile` or `atomic`). In fact, you can replace all atomics with a "simple" value + mutex scheme; it would still work and just be slower. `atomic` is allowed to use an internal mutex; only `atomic_flag` is guaranteed lock-free. – Erlkoenig Oct 25 '19 at 11:17
[Here](https://godbolt.org/z/M8Dow3) is an example of using a mutex instead of `atomic` that I think is correct an does work with GCC and Clang and `-O2` (but is probably slower). – Erlkoenig Oct 25 '19 at 11:36
@Erlkoenig: I'm pretty sure that isn't true, though it might accidentially "work", of course. That's the funny thing with UB, it can actually "work" despite being broken. Acquire/release semantics are a barrier which do not allow a read or write operation to be **moved** across the barrier. Still, they do not require an operation to happen at all, if -- within the thread's scope -- it's an useless operation, such as reading a value which is "obviously" constant. Same for writing a value which is "obviously" never used. – Damon Oct 25 '19 at 17:39
@Damon I know about UB, but using (only) mutexes with shared data isn't it. Any access to data shared between threads might be "obviously" useless and only become relevant in cooperation with other thread. If this were UB, then you'd have to use atomics all the time, and all the classic patterns of protecting shared data with mutexes wouldn't work. That can't be right; atomics "just" make things faster, but they are not _required_. Replace the `bool finished` with a `std::string finished` that is either "yes" or "no" - would you suggest you'd have to use `std::atomic`? – Erlkoenig Oct 25 '19 at 18:32
@Damon What is an obvious constant value? – curiousguy Nov 10 '19 at 10:58
@curiousguy: A value for which no codepath exists that modifies it (such as in the question) is "obviously" constant. There is no way `finished` could possibly ever have any value other than `false`. At least, not in any constellation where `func` is actually being called (and that's all that matters). The compiler couldn't possibly divine that some external magic alters the value, without being explicitly hinted to it (and it probably _shouldn't_ even care, since enforcing global consistency would be a very serious pessimization for 99.999% of all programs). – Damon Nov 11 '19 at 11:17
@Damon No path where? In 1 function? 1 TU? the program? – curiousguy Nov 11 '19 at 11:37

score 43 · Answer 2 · edited Oct 25 '19 at 20:54

43

Scheff's answer describes how to fix your code. I thought I would add a little information on what is actually happening in this case.

I compiled your code at godbolt using optimisation level 1 (-O1). Your function compiles like so:

func():
  cmp BYTE PTR finished[rip], 0
  jne .L4
.L5:
  jmp .L5
.L4:
  mov eax, 0
  ret

So, what is happening here? First, we have a comparison: cmp BYTE PTR finished[rip], 0 - this checks to see if finished is false or not.

If it is not false (aka true) we should exit the loop on the first run. This accomplished by jne .L4 which jumps when not equal to label .L4 where the value of i (0) is stored in a register for later use and the function returns.

If it is false however, we move to

.L5:
  jmp .L5

This is an unconditional jump, to label .L5 which just so happens to be the jump command itself.

In other words, the thread is put into an infinite busy loop.

So why has this happened?

As far as the optimiser is concerned, threads are outside of its purview. It assumes other threads aren't reading or writing variables simultaneously (because that would be data-race UB). You need to tell it that it cannot optimise accesses away. This is where Scheff's answer comes in. I won't bother to repeat him.

Because the optimiser is not told that the finished variable may potentially change during execution of the function, it sees that finished is not modified by the function itself and assumes that it is constant.

The optimised code provides the two code paths that will result from entering the function with a constant bool value; either it runs the loop infinitely, or the loop is never run.

at -O0 the compiler (as expected) does not optimise the loop body and comparison away:

func():
  push rbp
  mov rbp, rsp
  mov QWORD PTR [rbp-8], 0
.L148:
  movzx eax, BYTE PTR finished[rip]
  test al, al
  jne .L147
  add QWORD PTR [rbp-8], 1
  jmp .L148
.L147:
  mov rax, QWORD PTR [rbp-8]
  pop rbp
  ret

therefore the function, when unoptimised does work, the lack of atomicity here is typically not a problem, because the code and data-type is simple. Probably the worst we could run into here is a value of i that is off by one to what it should be.

A more complex system with data-structures is far more likely to result in corrupted data, or improper execution.

edited Oct 25 '19 at 20:54

Peter Cordes

245,674
35
423
606

answered Oct 23 '19 at 16:33

Baldrickk

3,827
1
12
25

3

C++11 does make threads and a thread-aware memory model part of the language itself. This means compilers can't invent writes even to non-`atomic` variables in code that doesn't write those variables. e.g. `if (cond) foo=1;` can't be transformed to asm that's like `foo = cond ? 1 : foo;` because that load+store (not an atomic RMW) could step on a write from another thread. Compilers were already avoiding stuff like that because they wanted to be useful for writing multi-threaded programs, but C++11 made it official that compilers had to not break code where 2 threads write `a[1]` and `a[2]` – Peter Cordes Oct 24 '19 at 04:27
2

But yes, other than that overstatement about how compilers aren't aware of threads *at all*, your answer is correct. Data-race UB is what allows hoisting loads of non-atomic variables including globals, and the other aggressive optimizations we want for single-threaded code. [MCU programming - C++ O2 optimization breaks while loop](//electronics.stackexchange.com/a/387478) on electronics.SE is my version of this explanation. – Peter Cordes Oct 24 '19 at 04:31
@PeterCordes: Who's "we"? There are many forms of hoisting optimizations that are useful, whose effects may be observable in weird corner cases, but there is also value in allowing certain forms of benign data race. Consider, for example, the behavior of string hashing in Java. If a string has never been hashed and Thread1 tries to hash it, it will compute the hash value and store it in the string object. If Thread2 then asks for the hash, it may or may not see the value; if it doesn't, it will compute the hash and store it. The reads and writes in Thread2 have data races... – supercat Oct 24 '19 at 17:36
...but the cost of possible redundant hash computation will almost always be less than the complexity needed to avoid it. A behavioral model that allows certain transformations in the absence of constructs that would block them could maximize the efficiency and safety with which various useful tasks could be accomplished, compared with requiring programmers to jump through hoops to accommodate a needlessly-inconvenient execution model. – supercat Oct 24 '19 at 17:42
@supercat: That's a good example. Standard library implementers on a platform where it's safe in practice could still do that with acquire/release semantics for an atomic already-hashed flag. But yes the hash data itself could still be written the same way twice, which is UB. Although that. Similar problem as a SeqLock: no UB-safe way to express a pattern where we can detect possible tearing and ignore. Even if the has is small enough to be lockfree as a relaxed atomic, that's still optimization defeating. And so is an atomic flag, maybe; maybe fake it with memory barriers? – Peter Cordes Oct 24 '19 at 23:05
@PeterCordes: Is there any zero-overhead way of creating an object in C++ with relaxed atomic semantics, whose sole effect would be to require that compilers not *add* reads or writes? From what I understand, on some platforms, constructing and destructing objects of an atomic integer type may require memory barriers, even if the object itself is never used for anything other than relaxed semantics. – supercat Oct 25 '19 at 14:52
1

@PeterCordes: One advantage of Java using a GC is that memory for objects won't be recycled without an intervening *global* memory barrier between the old and new usage, which means that any core that examines an object will always see some value that it has held at some time after the reference was first published. While global memory barriers can be very expensive if they're used frequently, they can greatly reduce the need for memory barriers elsewhere even when used sparingly. – supercat Oct 25 '19 at 15:38
@PeterCordes i didn't say the compiler doesn't know about threads. just that the optimiser doesn't account for possible thread interactions unless we specifically tell it. – Baldrickk Oct 25 '19 at 20:27
1

Yes, I knew that's what you were trying to say, but I don't think your wording 100% means that. Saying the optimizer "completely ignores them." isn't quite right: it's well known that truly ignoring threading when optimizing can involve things like word load / modify a byte in the word / word store, which in practice has caused bugs where one thread's access to a char or bitfield steps on a write to an adjacent struct member. See https://lwn.net/Articles/478657/ for the full story, and how only the C11 / C++11 memory model makes such an optimization illegal, not just undesired in practice. – Peter Cordes Oct 25 '19 at 20:50
"effectively" was a good change, but I think there's room to be more explicit about how the C++ rules work in that sentence without confusing or distracting readers. Feel free to roll back if you liked your wording better, I offer my edit merely as a suggestion. – Peter Cordes Oct 25 '19 at 20:56
1

No, that's good.. Thanks @PeterCordes. I appreciate the improvement. – Baldrickk Oct 26 '19 at 05:04

score 5 · Answer 3 · answered Oct 23 '19 at 17:18

5

For the sake of completeness in the learning curve; you should avoid using global variables. You did a good job though by making it static, so it will be local to the translation unit.

Here is an example:

class ST {
public:
    int func()
    {
        size_t i = 0;
        while (!finished)
            ++i;
        return i;
    }
    void setFinished(bool val)
    {
        finished = val;
    }
private:
    std::atomic<bool> finished = false;
};

int main()
{
    ST st;
    auto result=std::async(std::launch::async, &ST::func, std::ref(st));
    std::this_thread::sleep_for(std::chrono::seconds(1));
    st.setFinished(true);
    std::cout<<"result ="<<result.get();
    std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}

Live on wandbox

answered Oct 23 '19 at 17:18

Oblivion

6,320
2
9
30

1

Could also declare `finished` as `static` within the function block. It will still be initialized only once, and if it’s initialized to a constant, this does not require locking. – Davislor Oct 23 '19 at 22:00
The accesses to `finished` could also use cheaper `std::memory_order_relaxed` load and stores; there's no required ordering wrt. other variables in either thread. I'm not sure @Davislor's suggestion of `static` makes sense, though; if you had multiple spin-count threads you wouldn't necessary want to stop them all with the same flag. You do want to write the initialization of `finished` in a way that compiles to just initialization, not an atomic store, though. (Like you're doing with the `finished = false;` default initializer C++17 syntax. https://godbolt.org/z/EjoKgq). – Peter Cordes Jan 19 '20 at 01:34
@PeterCordes Putting the flag in an object does allow there to be more than one, for different thread pools, as you say. The original design had a single flag for all threads, however. – Davislor Jan 19 '20 at 20:31

Multithreading program stuck in optimized mode but runs normally in -O0

3 Answers3

Linked

Related