pthread_create(3) and memory synchronization guarantee in SMP architectures

Question

I am looking at the section 4.11 of The Open Group Base Specifications Issue 7 (IEEE Std 1003.1, 2013 Edition), section 4.11 document, which spells out the memory synchronization rules. This is the most specific by the POSIX standard I have managed to come by for detailing the POSIX/C memory model.

Here's a quote

4.11 Memory Synchronization

Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads:

fork() pthread_barrier_wait() pthread_cond_broadcast() pthread_cond_signal() pthread_cond_timedwait() pthread_cond_wait() pthread_create() pthread_join() pthread_mutex_lock() pthread_mutex_timedlock()

pthread_mutex_trylock() pthread_mutex_unlock() pthread_spin_lock() pthread_spin_trylock() pthread_spin_unlock() pthread_rwlock_rdlock() pthread_rwlock_timedrdlock() pthread_rwlock_timedwrlock() pthread_rwlock_tryrdlock() pthread_rwlock_trywrlock()

pthread_rwlock_unlock() pthread_rwlock_wrlock() sem_post() sem_timedwait() sem_trywait() sem_wait() semctl() semop() wait() waitpid()

(exceptions to the requirement omitted).

Basically, paraphrasing the above document, the rule is that when applications read or modify a memory location while another thread or process may modify it, they should make sure to synchronize the thread execution and memory with respect to other threads by calling one of the listed functions. Among them, pthread_create(3) is mentioned to provide that memory synchronization.

I understand that this basically means there needs to be some sort of memory barrier implied by each of the functions (although the standard seems not to use that concept). So for example returning from pthread_create(), we are guaranteed that the memory modifications made by that thread before the call appear to other threads (running possibly different CPU/core) after they also synchronize memory. But what about the newly created thread - is there implied memory barrier before the thread starts running the thread function so that it unfailingly sees the memory modifications synchronized by pthread_create()? Is this specified by the standard? Or should we provide memory synchronization explicitly to be able to trust correctness of any data we read according to POSIX standard?

Special case (which would as a special case answer the above question): does a context switch provide memory synchronization, that is, when the execution of a process or thread is started or resumed, is the memory synchronized with respect to any memory synchronization by other threads of execution?

Example:

Thread #1 creates a constant object allocated from heap. Thread #1 creates a new thread #2 that reads the data from the object. If we can assume the new thread #2 starts with memory synchronized then everything is fine. However, if the CPU core running the new thread has copy of previously allocated but since discarded data in its cache memory instead of the new value, then it might have wrong view of the state and the application may function incorrectly.

More concretely...

Previously in the program (this is the value in CPU #1 cache memory)
```
 int i = 0;        
```

Thread T0 running in CPU #0:

 pthread_mutex_lock(...);
 int tmp = i;
 pthread_mutex_unlock(...);

Thread T1 running in CPU #1:
```
 i = 42;
 pthread_create(...);
```

Newly created thread T2 running in CPU #0:

 printf("i=%d\n", i);    /* First step in the thread function */

Without memory barrier, without synchronizing thread T2 memory it could happen that the output would be

i=0

(previously cached, unsynchronized value).

Update: Lot of applications using POSIX thread library would not be thread safe if this implementation craziness was allowed.

I have retracted the downvote, since you apologized, and mistakes actually happens. no big problem :) But my answer, as well as the answer from david schwarz makes no sense now - so I've deleted mine one. — Sigi, Mar 12 '14 at 11:08
@nos - By my reading, `pthread_create` only synchronizes the calling thread. Though it would be insanity to need to assume the newly-created thread need memory syncronization before it can reliably access any data. — FooF, Mar 12 '14 at 11:16
two-word-difference makes two different answers.... butterfly effect in action :) — Sigi, Mar 12 '14 at 11:26

nos · Accepted Answer · 2014-03-12T11:44:52.837

is there implied memory barrier before the thread starts running the thread function so that it unfailingly sees the memory modifications synchronized by pthread_create()?

Yes. Otherwise there would be no point to pthread_create acting as memory synchronization (barrier).

(This is afaik. not explicitly stated by posix, (nor does posix define a standard memory model), so you'll have to decide whether you trust your implementation to do the only sane thing it possibly could - ensure synchronization before the new thread is run- I would not worry particularly about it).

Special case (which would as a special case answer the above question): does a context switch provide memory synchronization, that is, when the execution of a process or thread is started or resumed, is the memory synchronized with respect to any memory synchronization by other threads of execution?

No, a context switch does not act as a barrier.

Thread #1 creates a constant object allocated from heap. Thread #1 creates a new thread #2 that reads the data from the object. If we can assume the new thread #2 starts with memory synchronized then everything is fine. However, if the CPU core running the new thread has copy of previously allocated but since discarded data in its cache memory instead of the new value, then it might have wrong view of the state and the application may function incorrectly.

Since pthread_create must perform memory synchronization, this cannot happen. Any old memory that reside in a cpu cache on another core must be invalidated. (Luckily, the commonly used platforms are cache coherent, so the hardware takes care of that).

Now, if you change your object after you've created your 2. thread, you need memory synchronization again so all parties can see the changes, and otherwise avoid race conditions. pthread mutexes are commonly used to achieve that.

"Otherwise there would be no point..." Exactly! When reading standards, need to also interpret using common sense. — FooF, Mar 12 '14 at 11:57
As `pthread_join` is also in the list of POSIX library functions providing memory synchronization, it is also implied by common sense that when a thread returns it will also provide memory synchronization so that `pthread_join` gets returned data in good healthy up-to-date shape. — FooF, Mar 13 '14 at 02:02

Sigi · Answer 2 · 2014-03-12T11:29:26.630

1

cache coherent architectures guarantee from the architectural design point of view that even separated CPUs (ccNUMA - cache coherent Not Uniform Memory Architecture), with independent memory channels when accessing a memory location will not incur in the incoherency you are describing in the example.

This happens with an important penalty, but the application will function correctly.

Thread #1 runs on CPU0, and hold the object memory in cache L1. When thread #2 on CPU1 read the same memory address (or more exactly: the same cache line - look for false sharing for more info), it forces a cache miss on CPU0 before loading that cache line.

edited Mar 12 '14 at 11:29

answered Mar 12 '14 at 11:22

Sigi

4,505
1
16
21

While this is interesting in itself, how common are ccNUMA systems actually? Wikipedia article about [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) says about ccNUMA: "As of 2011, ccNUMA systems are multiprocessor systems based on the AMD Opteron processor, which can be implemented without external logic, and the Intel Itanium processor, which requires the chipset to support NUMA. [...] Earlier ccNUMA systems such as those from Silicon Graphics were based on MIPS processors and the DEC Alpha 21364 (EV7) processor." – FooF Mar 13 '14 at 07:17
1

If ccNUMA systems were not the exception, we would not have so much talk about memory barriers and stuff in the internet, I believe. For example there would be no need for kernel programmers to digest this: https://www.kernel.org/doc/Documentation/memory-barriers.txt – FooF Mar 13 '14 at 07:19
ccNUMA is not exception at all... it's the architecture of multiprocessor Intel architectures as well from westmeres for XF86 cpus (from 2008 more or less): as soon as AMD shown this superior more scalable architecture, with respect to Intel's FSB based SMPs, Intel jumped on it. I think that all today Intel multiprocessor systems are ccNUMA (sure you will find some exception to this statement :) ). I mentioned it because it's the extension of the intra-processor/inter-core cache coherency model that's in use in current microprocessor's caches. – Sigi Mar 13 '14 at 08:16
1

take your link (+1) and look for `CACHE COHERENCY` (all uppercase). You will see what the problem with cache coherent system is too (operation order). On the other side this problem is related to in-kernel spinlocks/memoy-barriers, where you have not access to the higher level functionality exposed by libc (the OS can't make use of the OS). pthreads interfaces have mechanism in place in order to guarantee correct implementation with respect to the specification, making use transparently of such lower lever infrastructure. – Sigi Mar 13 '14 at 08:31
See this one: http://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf explaining the QPI, (AMD HyperTransport, Intel version :) ). – Sigi Mar 13 '14 at 08:39
errata in my previous comment... westmere -> nehalem - that's the name of the first QPI based, Intel XF86 architecture. – Sigi Mar 13 '14 at 08:50
the discussion is not about the not-uniformity (NUMA) - but about the cache-coherency (CC). The same applies with SMPs. With MIPS I guess you do have cache coherency... time ago I used SGI Origin systems: MIPS ccNUMA (actually SGI invented the term ccNUMA - or at least, it was the first time I've heard it, circa 2000... NUMA links were actually the ex-Cray interconnections) – Sigi Mar 13 '14 at 09:44
Thanks, this is good stuff to know to write correct multithreaded programs. I will need to take time to study this stuff. What I have gathered is that in the ccNUMA systems and cache coherent systems in general(?), the CPU memory barriers are only needed to make sure the memory writes and reads are *ordered correctly* and not affected by the CPU memory access reordering (parallelism within a CPU instruction and memory access pipeline to maximize instruction throughput)? – FooF Mar 14 '14 at 02:40
memory barriers are not needed for "standard" multithreaded programs, but I agree that is very useful to know all the internals, especially when dealing with problems. But to develop software there are useful models - pthreads. For memory barriers, the answer is yes. this is quite a good summary: http://en.wikipedia.org/wiki/Memory_barrier – Sigi Mar 14 '14 at 20:36

David Schwartz · Answer 3 · 2014-03-12T10:15:02.653

You've turned the guarantee pthread_create provides into an incoherent one. The only thing the pthread_create function could possibly do is establish a "happens before" relationship between the thread that calls it and the newly-created thread.

There is no way it could establish such a relationship with existing threads. Consider two threads, one calls pthread_create, the other accesses a shared variable. What guarantee could you possibly have? "If the thread called pthread_create first, then the other thread is guaranteed to see the latest value of the variable". But that "If" renders the guarantee meaningless and useless.

Creating thread:

i = 1;
pthread_create (...)

Created thread:

if (i == 1)
   ...

Now, this is a coherent guarantee -- the created thread must see i as 1 since that "happened before" the thread was created. Our code made it possible for the standard to enforce a logical "happens before" relationship, and the standard did so to assure us that our code works as we expect.

Now, let's try to do that with an unrelated thread:

Creating thread:

i = 1;
pthread_create (...)

Unrelated thread:

if ( i == 1)
    ...

What guarantee could we possible have, even if the standard wanted to provide one? With no synchronization between the threads, we haven't tried to make a logical happens before relationship. So the standard can't honor it -- there's nothing to honor. There no particular behavior that is "right", so no way the standard can promise us the right behavior.

The same applies to the other functions. For example, the guarantee for pthread_mutex_lock means that a thread that acquires a mutex sees all changes made by, or seen by, any threads that have unlocked the mutex. We logically expect our thread to get the mutex "after" any threads that got the mutex "before", and the standard promises to honor that expectation so our code works.

My originally posted question was not clear enough. I have clarified my question. It is only relevant to think of two threads (the one calling `pthread_create()` and the newly-created thread), though to have old memory value cached in the CPU core of the newly-started thread, there need to be unrelated thread having accessed the memory location. — FooF, Mar 12 '14 at 11:42
In my mind the standard does not specify very coherently the memory synchronization semantics of the newly-created thread, hence the question, though thinking more of it, it would be madness to have implementation that does not guarantee the newly started thread is started without synchronizing with the calling thread. — FooF, Mar 12 '14 at 11:43
Sorry for unclear first version of the question, and thanks for the contribution. — FooF, Mar 12 '14 at 11:46

pthread_create(3) and memory synchronization guarantee in SMP architectures

3 Answers3

Linked