Why is notify required inside a critical section?

Question

I'm reading this book here (official link, it's free) to understand threads and parallel programming.

Here's the question.

Why does the book say that pthread_cond_signal must be done with a lock held to prevent data race? I wasn't sure, so I referred to this question (and this question too), which basically said "no, it's not required". Why would a race condition occur?
What and where is the race condition being described?

The code and passage in question is as follows.

...
The code to wake a thread, which would run in some other thread, looks like this:
pthread_mutex_lock(&lock);
ready = 1;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&lock);
A few things to note about this code sequence. First, when signaling (as well as when modifying the global variable ready), we always make sure to have the lock held. This ensures that we don’t accidentally introduce a race condition into our code. ...

_{(please refer to the free, official pdf to get context.)}

I couldn't comment with a small question in the link-2, so here is a full question.

Edit 1: I understand the lock is to control access to the ready variable. I am wondering why there's a race condition associated with the signaling. Specifically,

First, when signaling [...] we always make sure to have the lock held. This ensures that we don’t accidentally introduce a race condition into our code

Edit 2: I've seen resources and comments (from links commented below and during my own research), sometimes within the same page that say it doesn't matter or you must put it within a lock for Predictable Behavior^TM (would be nice if this can be touched upon too, if the behavior can be other than spurious wakeups). What must I follow?

Edit 3: I'm looking for more of a 'theoretical' answer, not implementation specific so that I can understand the core idea. I understand answers to these can be platform specific, but an answer that focuses on the core ideas of lock, mutex, condition variable as all implementations must follow these semantics, perhaps adding their own little quirks. Example, wait() can wake up spuriously, and given bad timing of signaling, can happen on 'pure' implementations too. Mentioning these would help.

My apologies for so many edits, but my dearth of in-depth knowledge in this field is confusing the heck outta me.

Any insight would be really helpful, thanks. Also, please feel free to point me to books where I can read these concepts in detail, and where I can learn C++ with these concepts too. Thanks.

In the paragraph following the code, including in the part you pasted, it says that the race condition is about the `ready` variable. It has nothing to do with `pthread_cond_signal`. Have you completely read the text around this code snippet? To me at least, it clearly explains exactly why `pthread_cond_signal` is before the `pthread_mutex_unlock`. — Thomas Jager, Feb 08 '20 at 16:53
I see in the text you link: *"To use a condition variable, one has to in addition have a lock that isassociated with this condition. When calling either of the above routines,this lock should be held"* - is it this the reason that made you ask the question? If so you should add this to the question before the first quotation as it better explains why you are asking it. — Marco Bonelli, Feb 08 '20 at 17:01
No, I understand that a condition variable must in conjunction with a lock. The reasoning is "What threads must be awakened? Those associated with this lock." — B_Dex_Float, Feb 08 '20 at 17:07
And also this one: https://stackoverflow.com/questions/6419117/signal-and-unlock-order — dragosht, Feb 08 '20 at 17:22
I am not sure. Some sources suggest that I must 'signal-then-unlock' while others say 'it doesn't matter'. Take [this answer](https://stackoverflow.com/a/4544494/8560442) for example, it quotes that I must, for 'predictable behavior' (in quotes because I'm not sure what is unpredictable there, apart from spurious wakeups, which don't seem to make sense to me in that answer's context). A [comment](https://stackoverflow.com/questions/4544234/calling-pthread-cond-signal-without-locking-mutex#comment9223783_4567919) says that it 'doesn't matter' — B_Dex_Float, Feb 08 '20 at 17:36
The book referenced is speaking specifically about pthreads. Tag added. — John Bollinger, Feb 09 '20 at 18:18
@JohnBollinger True, but the ideas presented are general. It only used pthreads to show code that can be readily executed. The question posted is also about locking/waiting/notifying in general, irrespective of the implementation. Should I add an edit to make that explicit? I assumed it was understood. — B_Dex_Float, Feb 09 '20 at 18:29
@B_Dex_Float, mutexes and condition variables are general concepts, but the comment about which you are inquiring pertains to a specific implementation (pthreads), and implementations are not necessarily all the same in that regard. The tag I already added is the minimum information appropriate for making the context clear, but if you want to clarify in the text as well then that would be good, too. — John Bollinger, Feb 09 '20 at 19:50

John Bollinger · Answer 1 · 2020-02-09T20:01:23.357

Why does the book say that pthread_cond_signal must be done with a lock held to prevent data race? I wasn't sure, so I referred to this question (and this question too), which basically said "no, it's not required". Why would a race condition occur?

The book not presenting a complete example, my best guess as to the intended meaning is that there can be a data race with the CV itself if it is signaled without the associated mutex being held. That may be the case for some CV implementations, but the book is talking specifically about pthreads, and pthreads CVs are not subject to such a limitation. Neither is C++ std::condition_variable, which is what the two other SO questions you referred to are talking about. So in that sense, the book is just wrong.

It is true that one can compose examples of poor CV use, in conjunction with which signaling under protection of the associated mutex largely protects against data races, but signaling without such protection is susceptible to data races. But In such a case, the fault is not with the signaling itself, but with the waiting, and if that's what the book means then it is deceptively worded. And probably still wrong.

What and where is the race condition being described?

One can only guess what the author had in mind.

For the record, the proper usage of condition variables involves firstly determining what condition one wants to ensure holds before execution proceeds. That condition will necessarily involve shared variables, else there is no reason to expect that anything another thread does could change whether the condition is satisfied. That being the case, all access to the shared variables involved needs to be protected by a mutex if more than one thread is alive.

That mutex should then, secondly, also be the one associated with the CV, and threads must wait on the CV only while the mutex is held. This is a requirement of every CV implementation I know, and it protects against signals being missed and possible deadlock resulting from that. Consider this faulty, and somewhat contrived, example:

// BAD
int temp;

result = pthread_mutex_lock(m);
// handle failure results ...

temp = shared;

result = pthread_mutex_unlock(m);
// handle failure results ...

if (temp == 0) {
    result = pthread_cond_wait(cv, m);
    // handle failure results ...
}

// do something ...

Suppose that it was allowed to wait on the CV without holding the mutex, as that code does. That code supposes that at some point in the future, some other thread (T2) will update shared (under protection of the mutex) and then signal the CV to tell the waiting one (T1) that it can proceed. But what if T2 does that between when T1 unlocks the mutex and when it begins its wait? It doesn't matter whether T2 signals the CV under protection of the mutex or not -- T1 will begin a wait for a signal that has already been delivered. And CV signals do not queue.

So suppose that T1 only waits under protection of the mutex, as is in fact required. That's not enough. Consider this:

// ALSO BAD

result = pthread_mutex_lock(m);
// handle failure results ...

if (shared == 0) {
    result = pthread_cond_wait(cv, m);
    // handle failure results ...
}

result = pthread_mutex_unlock(m);
// handle failure results ...

// do something ...

This is still wrong, because it does not reliably prevent T1 from proceeding past the wait when the condition of interest is unsatisfied. Such a scenario can arise from

the signal being legitimately sent and received even though the particular condition of interest to T1 is not satisfied
the signal being legitimately sent and received, and the condition being satisfied when the signal is sent, but T2 or another thread modifying the shared variable again before T1 returns from its wait.
spurious return from the wait, which is very rare, but does occasionally happen in many real-world implementations.

None of that depends on T2 sending the signal without mutex protection.

The correct way to wait on a condition variable is to check the condition of interest before waiting, and afterward to loop back and check again before proceeding:

// OK

result = pthread_mutex_lock(m);
// handle failure results ...

while (shared == 0) {  // <-- 'while', not 'if'
    result = pthread_cond_wait(cv, m);
    // handle failure results ...
}
// typically, shared = 0 at this point

result = pthread_mutex_unlock(m);
// handle failure results ...

// do something ...

It may sometimes be the case that thread T1 executing that code will return from its wait when the condition is not satisfied, but if ever it does then it will simply return to waiting instead of proceeding when it shouldn't. If other threads signal only under protection of the mutex then that should be rare, but still possible. If other threads signal without mutex protection then T1 may wake more often than strictly needed, but there is no data race involved, and no inherent risk of misbehavior.

Minor clarification, in the second bullet point; it implies that there is a small duration of time from the thread waking to `wait` acquiring the lock where the thread can be preempted, correct? Also, my understanding of the 'data race' the author mentions is the race you describe here. — B_Dex_Float, Feb 09 '20 at 20:09
@B_Dex_Float, any running thread can be preempted at any time. Holding a mutex does not prevent that. It prevents only other threads acquiring the same mutex. — John Bollinger, Feb 09 '20 at 20:11
Yes yes, I mean that that situation causes some other thread to hold the lock. This causes the waking thread to see a different change from what the `notify` was meant for. — B_Dex_Float, Feb 09 '20 at 20:14
@B_Dex_Float, after a thread wakes, before it returns from `pthread_cond_wait()`, it contends for the mutex just like any other thread. In particular, if the signal was sent by a thread holding the mutex then very possibly that other thread will still hold it when the waiting one first wakes. — John Bollinger, Feb 09 '20 at 20:18
Yes, I do realize that. Was just making sure my reasoning for the notification being weird from the perspective of the waking thread was correct, so that I can reason better how `wait` works. — B_Dex_Float, Feb 09 '20 at 20:23
I'm just trying to answer the question you posed, @B_Dex_Float. To put it more directly, (1) a different thread may hold the mutex when T1 wakes; (2) either way, any number of other threads may acquire the mutex and subsequently release it after T1 wakes but before T1 acquires it, and therefore before T1 returns from waiting. — John Bollinger, Feb 09 '20 at 20:46
Yes, I was trying to wrap my head around all of the consequences of this situation. Thank you for the exchange! — B_Dex_Float, Feb 10 '20 at 09:39

Andrey Semashev · Answer 2 · 2020-02-09T18:48:09.090

Why does the book say that pthread_cond_signal must be done with a lock held to prevent data race? I wasn't sure, so I referred to this question (and this question too), which basically said "no, it's not required". Why would a race condition occur?

Yes, condition variable notification should generally be performed with the corresponding mutex locked. The reason is not so much to avoid a race condition but to avoid a missed or superfluous notification.

Consider the following piece of code:

std::queue< int > events;

std::mutex mutex;
std::condition_variable cond;

// Thread 1
void consume_events()
{
    std::unique_lock< std::mutex > lock(mutex); // #1
    while (true)
    {
        if (events.empty())                     // #2
        {
            cond.wait(lock);                    // #3
            continue;
        }

        // Process an event
        events.pop();
    }
}

// Thread 2
void produce_event(int event)
{
    {
        std::unique_lock< std::mutex > lock(mutex); // #4
        events.push(event);                         // #5
    }                                               // #6

    cond.notify_one();                              // #7
}

This is a classical example of one producer/one consumer queue of data.

In the line #1 the consumer (Thread 1) locks the mutex. Then, in line #2, it tests if there are any events in the queue and, if there are none, in line #3 unlocks mutex and blocks. When the notification on the condition variable happens, the thread unblocks, immediately locks mutex and continues execution past line #3 (which is to go to line #2 again).

In the line #4 the producer (Thread 2) locks the mutex and in line #5 it enqueues a new event. Because the mutex is locked, event queue modification is safe (line #5 cannot be executed concurrently with line #2), so there is no data race. Then, in line #6, the mutex is unlocked and in line #7 the condition variable is notified.

It is possible that the following happens:

Thread 2 acquires the mutex in line #4.
Thread 1 attempts to acquire the mutex in line #1 or #3 (upon being unblocked by a previous notification). Since the mutex is locked by Thread 2, Thread 1 blocks.
Thread 2 enqueues the event in line #5 and unlocks the mutex in line #6.
Thread 1 unblocks and acquires the mutex. In line #2 it sees that the event queue is not empty and processes the event. On the next loop iteration the queue is empty and the thread blocks in line #3.
Thread 2 notifies Thread 1 in line #7. But there are no queued events, and Thread 1 wakes up in vain.

Though in this particular example, the extra wake up is benign, depending on the loop contents, it may be detrimental. The correct code should call notify_one before unlocking the mutex.

Another example is when one thread is used to initiate some work in the other thread without an explicit queue of events:

std::mutex mutex;
std::condition_variable cond;

// Thread 1
void process_work()
{
    std::unique_lock< std::mutex > lock(mutex); // #1
    while (true)
    {
        cond.wait(lock);                        // #2

        // Do some processing                   // #3
    }
}

// Thread 2
void initiate_work_processing()
{
    cond.notify_one();                          // #4
}

In this case Thread 1 waits until it is time to perform some activity (e.g. render a frame in a video game). Thread 2 periodically initiates that activity by notifying Thread 1 via condition variable.

The problem is that the condition variable does not buffer notifications and acts only on the threads that are actually blocked on it at the point of notification. If there are no threads blocked then the notification does nothing. This means that the following sequence of events is possible:

Thread 1 acquires the mutex in line #1 and blocks in line #2.
Thread 2 decides it is time to perform the periodic activity and notifies Thread 1 in line #4.
Thread 1 unblocks and goes to perform the activities (e.g. render a frame).
It turns out that this frame is a lot of work, and when Thread 2 comes to notify Thread 1 about the next frame in line #2, Thread 1 is still busy with the previous one. This notification gets missed.
Thread 1 is finally done with the frame and blocks in line #2. The user observes a frame dropped.

The above wouldn't have happened if Thread 2 locked mutex before notifying Thread 1 in line #4. If Thread 1 is still busy rendering a frame, Thread 2 would block until Thread 1 is done and only then issue the notification.

However, the correct solution for the above task is to introduce a flag or some other data protected by the mutex that Thread 2 can use to signal Thread 1 that it is time to perform its activities. Aside from fixing the missed notification problem, this also takes care of spurious wakeups.

What and where is the race condition being described?

Definition of a data race depends on the memory model used in the particular environment. This means primarily your programming language memory model and may include the underlying hardware memory model (if the programming language relies on the hardware memory model, which is the case with e.g. Assembler).

C++ defines data races as follows:

When an evaluation of an expression writes to a memory location and another evaluation reads or modifies the same memory location, the expressions are said to conflict. A program that has two conflicting evaluations has a data race unless

both evaluations execute on the same thread or in the same signal handler, or

both conflicting evaluations are atomic operations (see std::atomic), or

one of the conflicting evaluations happens-before another (see std::memory_order)

If a data race occurs, the behavior of the program is undefined.

So basically, when multiple threads access the same memory location concurrently (by means other than std::atomic) and at least one of the threads is modifying the data at that location, that is a data race.

In the explanation for the first question, under the first set of points, can `wait()` be woken up without a `notify`? I'm seeing mixed comments about spurious wakeups, so I'm not sure if I have to _always_ keep that in mind while coding. — B_Dex_Float, Feb 09 '20 at 17:23
About the data race (apart from the `ready` variable), what is the author referring to in the book? I understood it as the race that can cause dropped notifications after reading your excellent answer. Is this is what the author means to point out? If I may, could you kindly suggest resources where I can learn these better? Clearly I'm having trouble reasoning about these cases. Books where I can learn these to the core and reason about them to a point where a future version of me can answer this question? I could _really_ use some resources. Thanks. — B_Dex_Float, Feb 09 '20 at 17:30
About the comment in rendering frames, very interesting! I'm actually writing a small engine to learn C++ (by doing this project). Where would a dropped notification be potentially useful, if at all? I understand this is probably irrelevant to the question+answer, but I was curious nonetheless. — B_Dex_Float, Feb 09 '20 at 17:40
> In the explanation for the first question, under the first set of points, can wait() be woken up without a notify? -- Yes, spurious wakeups are always a possibility. — Andrey Semashev, Feb 09 '20 at 17:55
Justifying the claim based on poor, non-idiomatic use of the condition variable is not very strong, and *demonstrating* poor use of condition variables is counterproductive. Additionally, presenting C++ code in a question about C is not really appropriate, though I grant that in this case, the concepts aren't very language-specific. — John Bollinger, Feb 09 '20 at 18:14
> About the data race (apart from the ready variable), what is the author referring to in the book? -- I did not read the book in full, but I think he means what I described in the answer. POSIX allows `pthread_cond_signal`/`pthread_cond_broadcast` to be called without the mutex locked (https://pubs.opengroup.org/onlinepubs/007908799/xsh/pthread_cond_signal.html). Though some custom implementations of condition variables may have a requirement for external synchronization via a mutex. — Andrey Semashev, Feb 09 '20 at 18:17
> Where would a dropped notification be potentially useful, if at all? -- I don't think missed notification are useful. Some kind of load throttling can be useful sometimes, but only if controlled and in the places you expect and want it to happen. This is definitely not what missed notifications are. — Andrey Semashev, Feb 09 '20 at 18:21
@AndreySemashev, what you present in both your examples is non-idiomatic use of the condition variable, and *that* is what gives rise to the data race potential within. Consider that if there were more than one consumer thread then the same kind of races you describe could happen even with the CV being signaled only while the mutex is held. — John Bollinger, Feb 09 '20 at 18:22
> Justifying the claim based on poor, non-idiomatic use of the condition variable is not very strong, and demonstrating poor use of condition variables is counterproductive. -- The first example is a classic consumer/producer queue. If you mean the second example, then this is the naive way to implement inter-thread signalling, something that first comes to mind when you're not an experienced programmer. If you can suggest a better example and explanation, please post an answer. — Andrey Semashev, Feb 09 '20 at 18:24
In both examples, I mean continuing past return from the wait without verifying that the condition is satisfied. — John Bollinger, Feb 09 '20 at 18:59
The first example does verify it. The second doesn't and I point out that it should. — Andrey Semashev, Feb 09 '20 at 19:12
The first example checks the condition before waiting, but not before proceeding *past* the wait. The second does neither. With proper use of a pthreads or C++ CV, the only risk from signaling without mutex protection is unneeded wakeups, but these pose only an efficiency issue, not a correctness issue. I see no reason to believe that the book uses the term "data race" to describe that situation. It's much more plausible that the book is just wrong. — John Bollinger, Feb 09 '20 at 19:58
But he makes valid points and uses examples that have problems and explains what they are. Could you explain what you see as incorrect in his examples or explanation? They seem well laid out and explained. — B_Dex_Float, Feb 09 '20 at 20:22
> The first example checks the condition before waiting, but not before proceeding past the wait. The second does neither. -- Please, re-read the answer. The first example continues to the next loop iteration, which checks the condition again. The second example discussion ends with "the correct solution for the above task is to introduce a flag or some other data protected by the mutex that Thread 2 can use to signal Thread 1 that it is time to perform its activities." — Andrey Semashev, Feb 09 '20 at 21:37

Why is notify required inside a critical section?

2 Answers2

Linked