Can MESI protocol auto sync a variable value bewteen cpu cores?

Question

In my knowledge, concurrent access to a variable needs some kind of synchronization(mutex, atomic, memory barrier...) or else read in one thread may never gets updated value no matter how many times it try.

However, my colleague says the MESI protocol(not consider cpus with has no MESI or similar thing) able to auto synchronize between cpu caches, if read a variable which updated by other thread with no any sychronization in read and write(just plain read, for example "if(a != 0)"), after a period, read will finally gets the updated value if it keep going try. I think there is no guarantee here.

So I wrote a code to test this:

volatile int * volatile a = 0; // avoid compiler reorder
void set() {
    a = new int(1);
    std::cout << "set complete" << std::endl;
}
void read(int i) {
    while(1) {
        if(a != 0) {
            std::cout << i << " detected" << std::endl;
            break;
        }
    }
}
int main()
{
    std::thread td00(std::bind(read, 0));
    std::thread td01(std::bind(read, 1));
    std::thread td02(std::bind(read, 2));
    std::thread td03(std::bind(read, 3));
    std::thread td04(std::bind(read, 4));
    // wait a moment to make sure 'set' gets called after 'read' runs
    std::this_thread::sleep_for(std::chrono::milliseconds(500));
    std::thread td1(set);
    td1.join();
    td00.detach();
    td01.detach();
    td02.detach();
    td03.detach();
    td04.detach();
    std::this_thread::sleep_for(std::chrono::minutes(60));
    return 0;
}

However, the running can affect by many factors, sometimes it blocks, sometimes it print "detect". It can not be a strong proof.

I have searched this but the docs was unclear about this. It seems MESI indeed can do "auto sync"(programmer no need to do anything), the 'PrRd' and 'PrWr' seems just normal read write request without LOCK or CMPXCHG or something like that. However, for speed up, it introduced a store buffer, this will make cpu disorder and invalidate the effect of "auto sync". For fix the disorder, programmer needs use tools(memory barrier) to contorl it. That means programmer have to do sync manually to make thing right.

Does I understand this correct? If it is, assume programmer not to do it manually, is there any guarantee of time delay to gets the updated value? I think a read may never gets the updated value, however I can not find the evidence.

The cache is coherent on x86. Programs don't need to push anything, the CPU will try to empty the store buffer asap. The premise is wrong. Memory barrier controls the ordering of stores/loads to avoid the *current* CPU storing/loading a value before another value (e.g. a lock) is globally visible. Visibility that is handled automatically (except maybe for WC areas when the CPU is halted and the interrupts masked. Maybe, not sure if the CPU will still flush the WC buffers after a while). — Margaret Bloom, May 11 '21 at 09:26
Your program always terminates for me. There's nothing in the assembly that would prevent that. Also, MESI doesn't push, it snoops and uses RFOs. — Margaret Bloom, May 11 '21 at 09:26
Btw the C++ aspect to the question is different than machine level aspect. — harold, May 11 '21 at 10:58
@nicomp C++ code just to validate how cpu works. It can be any other language — jean, May 11 '21 at 11:01
[When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) explains that caches are coherent, that's why volatile can work as a hand-rolled `memory_order_relaxed` atomic. — Peter Cordes, May 11 '21 at 11:14
@Peter Cordes So the conculsion is: a write is globally visible without any synchronization under x86_64? — jean, May 11 '21 at 14:09
Yes, same as on every other mainstream CPU architecture that Linux can run on. (Unless there are any where `WRITE_ONCE` isn't just a cast to `volatile T*`) — Peter Cordes, May 11 '21 at 14:11
Is this a C++ question or a cpu-architecture question? If it's just cpu-architecture, the title question is a duplicate of [If I don't use fences, how long could it take a core to see another core's writes?](https://stackoverflow.com/q/51292687). The only reason there's more to say is your test case with no ordering between the two `cout< — Peter Cordes, May 13 '21 at 13:58

jean · Answer 1 · 2021-05-13T06:24:13.760

0

The conculsion is: x86_64 is cache coherent, a normal simple write is globally visable to all others cores or cpus which share one bus.

However, this is useless for writing normal application code(not include low level thing like compiler, os kernel...). Language memory model completely hide those cache coherent protocol from coder. Coding should not relies or utilizes those protocol feature because compiler or language virtual machine, runtime can disorder, optimize your code. Even you know what exactly will happen, disobey language memory model to write code still is subtle and error-prone.

One of the possibilities is that will make example code in question print "x detect" even before the set func calls(A reference to show how that can happen), or, variable a was stored in register make mesi powerless if without the volatile keywords. Let alone most language has no c/c++ comparing volatile keywords which allow coder choose just compiler not to "change" the origin code.

edited May 13 '21 at 06:24

answered May 12 '21 at 10:13

jean

2,456
2
28
54

Cache coherency is not useless, and neither is `memory_order_relaxed`. Your code can *print* out of order, but that's because `std::cout << "set complete"` is separate from the actual volatile store. `x detected`definitely can't be printed before `set()` is even *called*. No other threads can read the result of a volatile store until some time after it happens. (Branch prediction can let code execute *speculatively* based on an assumption about what they're read, and confirm it later, but visible side-effects like printing have to wait. And x86 has strong memory-ordering rules.) – Peter Cordes May 12 '21 at 13:45
Of course, you're right that memory reordering can mix up notions of what "before" and "after" mean. That's why you should use `std::atomic` with the memory-order you want, not roll your own with `volatile`. But note that you example could still print `x detected` before `set complete` with `memory_order_seq_cst` because nothing makes the reader wait for printing in the writer before running its own print, only the `a` value. In fact, seq_cst probably makes that more likely by slowing down the writer with a full barrier as part of the store before it gets to `cout< – Peter Cordes May 12 '21 at 13:48
See [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) for more about cache coherence and volatile, and how that's like `memory_order_relaxed` – Peter Cordes May 12 '21 at 13:49
Sorry for the obscure, I updated my answer – jean May 13 '21 at 06:24
*Language memory model completely hide those cache coherent protocol from coder.* - not really: the ISO C++ standard has rules about the visibility of all atomic accesses, even `memory_order_relaxed` - https://eel.is/c++draft/intro.multithread#intro.races-note-19 - *[Note 19: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note]* – Peter Cordes May 13 '21 at 13:23
I think you're hoping for too much from MESI, or misunderstanding the fact that there is no order in the first place, the way you've written things. There's no compile-time reordering. (`a = ...` then both `cout < a` with `memory_order_seq_cst` would allow the "detect" to be printed first. If you want "set complete" to be printed first, print it *before* actually doing the set. (And make the store a release-store, e.g. by only running on x86.) – Peter Cordes May 13 '21 at 13:27
Thanks for the details. The "completely hide" I mean coder no need to know the exist of mesi, c++'s memory_order_xxx still is a abstract of those protocol. The printed string just used to prove variable 'a' changed, it indeed can be out of ordered or interleaved, what I said "before" is even variable 'a' has no changed, the reader thread may detect it changed, a false positive – jean May 14 '21 at 01:40
*a false positive* - No, that doesn't happen. The reader will never detect that `a` has changed until it actually has. x86's memory model is "multicopy atomic" - the only way a store can become visible to a load on another core is by that store becoming visible to *all* other cores, i.e. globally visible. (ISO C++ doesn't guarantee that except for seq_cst; ISO C++'s memory model allows IRIW reordering, which can happen [on PowerPC, unlike ARMv8 and x86](https://stackoverflow.com/questions/67397460/does-stlrb-provide-sequential-consistency-on-arm64/67397890#67397890)) – Peter Cordes May 14 '21 at 02:06
Maybe you still haven't understood that what your program is doing is an unsynchronized print sometime after the store, and another unsynchronized print sometime after the reader sees the store. You seem to be claiming that the order of seeing those prints tells you something about the order the store and load happened, but they don't. – Peter Cordes May 14 '21 at 02:08
To prove reordering shenanigans, you'd have to put the "set complete" print first before the set, and the "detected" print after the read, and then at runtime see them in the other order, to prove that one or the other had passed the store/load to `a`, instead of just some interleaving of program order within each thread. – Peter Cordes May 14 '21 at 02:10
For example, [Is 11 a valid output under ISO c++ for x86\_64,arm or other arch?](https://stackoverflow.com/a/66971509) is an example of considering possible interleavings of program order. Related: [Loads and stores reordering on ARM](https://stackoverflow.com/a/59089757) discusses the fact that a total order of all stores *and loads* to a single atomic object is guaranteed (by ISO C++) to exist, more or less thanks to MESI or other cache-coherency mechanism. (You're correct that you don't *have* to understand how/what hardware implements the formal C++ memory model, though.) – Peter Cordes May 14 '21 at 02:15
My bad, the example code is too simple to describe my intention. The argue with my colleague is there is pointer variable in golang which will updated by a goroutine, and others goroutine will read it. However this variable without any synchorization like atomic.Value, I told him the reader goroutine may never see the update action, but he told me mesi will do that, so that is ok. Before I ask this question, I thought cpu cache coherent is relies on programmer, if no synchronization, cpu won't do that. Now I knows, cpu with mesi can do that automaticlly. However, that is not enough. – jean May 14 '21 at 02:31
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/232375/discussion-between-jean-and-peter-cordes). – jean May 14 '21 at 02:31
Sounds like a Go version of [Multithreading program stuck in optimized mode but runs normally in -O0](https://stackoverflow.com/q/58516052) then. You need *something* (like `volatile` or `atomic` with at least mo_relaxed, or a compiler barrier https://lwn.net/Articles/793253/) to make sure the load actually happens in asm, rather than loading once ahead of a loop and assuming the value stays constant. Freedom to keep a value in a register is why C++ declares that unsynced read/write of non-atomic variables are UB. I don't know the Go memory model or if it has anything like relaxed atomic. – Peter Cordes May 14 '21 at 02:36

Can MESI protocol auto sync a variable value bewteen cpu cores?

1 Answers1