Concurrent stores seen in a consistent order

Question

The Intel Architectures Software Developer's Manual, Aug. 2012, vol. 3A, sect. 8.2.2:

Any two stores are seen in a consistent order by processors other than those performing the stores.

But can this be so?

The reason I ask is this: Consider a dual-core Intel i7 processor with HyperThreading. According to the Manual's vol. 1, Fig. 2-8, the i7's logical processors 0 and 1 share an L1/L2 cache, but its logical processors 2 and 3 share a different L1/L2 cache -- whereas all the logical processors share a single L3 cache. Suppose that logical processors 0 and 2 -- which do not share an L1/L2 cache -- write to the same memory location at about the same time, and that the writes go no deeper than L2 for the moment. Could not logical processors 1 and 3 (which are "processors other than those performing the stores") then see the "two stores in an inconsistent order"?

To achieve consistency, must not logical processors 0 and 2 issue SFENCE instructions, and logical processors 1 and 3 issue LFENCE instructions? Notwithstanding, the Manual seems to think otherwise, and its opinion in the matter does not have the look of a mere misprint. It looks deliberate. I'm confused.

UPDATE

In light of @Benoit's answer, a following question: The only purpose of L1 and L2 therefore is to speed loads. It is L3 that speeds stores. Is that right?

Peter Cordes · Accepted Answer · 2018-06-26T04:20:51.220

Intel CPUs (like all normal SMP systems) use (a variant of) MESI to ensure cache coherency for cached loads/stores. i.e. that all cores see the same view of memory through their caches.

A core can only write to a cache line after doing a Read For Ownership (RFO), getting the line in Exclusive state (no other caches have a valid copy of the line that could satisfy loads). Related: atomic RMW operations prevent other cores from doing anything to the target cache-line by locking it in Modified state for the duration of the operation.

To test for this kind of reordering, you need two other threads which both read both stores (in opposite order). Your proposed scenario has one core (reader2) reading an old value from memory (or L3, or its own private L2/L1) after another core (reader1) has read the new value of the same line stored by writer1. This is impossible: for reader1 to see writer1's store, writer1 must have already completed a RFO that invalidates all other copies of the cache line anywhere. And reading directly from DRAM without (effectively) snooping any write-back caches is not allowed. (Wikipedia's MESI article has diagrams.)

When a store commits (from the store buffer inside a core) to L1d cache, it becomes globally visible to all other cores at the same time. Before that, only the local core could "see" it (via store->load forwarding from the store buffer).

On a system where the only way for data to propagate from one core to another is through the global cache-coherency domain, MESI cache coherency alone guarantees that a single global store order exists, that all threads can agree on. x86's strong memory ordering rules make this global store order be some interleaving of program order, and we call this a Total Store Order memory model.

x86's strong memory model disallows LoadLoad reordering, so loads take their data from cache in program order without any barrier instructions in the reader threads.¹

Loads actually snoop the local store buffer before taking data from the coherent cache. This is the reason the consistent order rule you quoted excludes the case where either store was done by the same core that's doing the loads. See Globally Invisible load instructions for more about where load data really comes from. But when the load addresses don't overlap with any recent stores, what I said above applies: load order is the order of sampling from the shared globally coherent cache domain.

The consistent order rule is a pretty weak requirement. Many non-x86 ISAs don't guarantee it on paper, but very few actual (non-x86) CPU designs have a mechanism by which one core can see store data from another core before it becomes globally visible to all cores. IBM POWER with SMT is one such example: Will two atomic writes to different locations in different threads always be seen in the same order by other threads? explains how forwarding between logical cores within one physical core can cause it. (This is a like what you proposed, but within the store buffer rather than L2).

x86 microarchitectures with HyperThreading (or AMD's SMT in Ryzen) obey that requirement by statically partitioning the store buffer between the logical cores on one physical core. What will be used for data exchange between threads are executing on one Core with HT? So even within one physical core, a store has to commit to L1d (and become globally visible) before the other logical core can load the new data.

It's probably simpler to not have forwarding from retired-but-not-committed stores in one logical core to the other logical cores on the same physical core.

(The other requirements of x86's TSO memory model, like loads and stores appearing in program order, are harder. Modern x86 CPUs execute out of order, but use a Memory Order Buffer to maintain the illusion and have stores commit to L1d in program order. Loads can speculatively take values earlier than they're "supposed" to, and then check later. This is why Intel CPUs have "memory-order mis-speculation" pipeline nukes: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?.)

As @BeeOnRope points out, there is an interaction between HT and maintaining the illusion of no LoadLoad reordering: normally a CPU can detect when another core touched a cache line after a load actual read it but before it was architecturally allowed to have read it: the load port can track invalidations to that cache line. But with HT, load ports also have to snoop the stores that the other hyperthread commits to L1d cache, because they won't invalidate the line. (Other mechanisms are possible, but it is a problem that CPU designers have to solve if they want high performance for "normal" loads.)

Footnote 1: On a weakly-ordered ISA, you'd use load-ordering barriers to control the order in which the 2 loads in each reader take their data from the globally coherent cache domain.

The writer threads are only doing a single store each so a fence is meaningless. Because all cores share a single coherent cache domain, fences only need to control local reordering within a core. The store buffer in each core already tries to make stores globally visible as quickly as possible (while respecting the ordering rules of the ISA), so a barrier just makes the CPU wait before doing later operations.

x86 lfence has basically no memory-ordering use cases, and sfence is only useful with NT stores. Only mfence is useful for "normal" stuff, when one thread is writing something and then reading another location. http://preshing.com/20120515/memory-reordering-caught-in-the-act/. So it blocks StoreLoad reordering and store-forwarding across the barrier.

In light of @Benoit's answer, a following question: The only purpose of L1 and L2 therefore is to speed loads. It is L3 that speeds stores. Is that right?

No, L1d and L2 are write-back caches: Which cache mapping technique is used in intel core i7 processor?. Repeated stores to the same line can be absorbed by L1d.

But Intel uses inclusive L3 caches, so how can L1d in one core have the only copy? L3 is actually tag-inclusive, which is all that's needed for L3 tags work as a snoop filter (instead of broadcasting RFO requests to every core). The actual data in dirty lines is private to the per-core inner caches, but L3 knows which core has the current data for a line (and thus where to send a request when another core wants to read a line that another core has in Modified state). Clean cache lines (in Shared state) are data-inclusive of L3, but writing to a cache line doesn't write-through to L3.

There is actually a "special case" with hyper-siblings when it comes to memory consistency, which is why you get the memory order clears due to the cross-thread snoop hitting load buffer thing. I think that happens is that due to aggressive load ordering, to keep loads appearing in order the MOB will issue a ordering machine clear when some cache line involved in a pending load gets invalidated before it retires (the conditions might be a bit narrower than that, but this is the general idea). In the case of hyper-siblings, however, there is no invalidation ... — BeeOnRope, Jun 26 '18 at 03:07
... as the line just stays in L1 in M state. So you need this special cross-thread snoop to avoid breaking the memory ordering in this case (weaker platforms like PPC can take a different approach which is how you get the IRIW reordering there). — BeeOnRope, Jun 26 '18 at 03:08
@BeeOnRope: Ah right, that would cause LoadLoad reordering within a hyperthread, which could be a source of IRIW reordering (and thus must be guarded against). PowerPC probably doesn't have to make this case special at all: without barriers it can just let LoadLoad reordering happen. With a barrier, the later load won't try to take its value too early in the first place unless the implementation speculates that the barrier wasn't needed (like x86), but that seems unlikely. — Peter Cordes, Jun 26 '18 at 04:46
Your answer is formidable. It merits a broader audience than my obscure question can afford it. I appreciate it, at any rate. — thb, Jul 02 '18 at 03:09

score 1 · Answer 2 · answered Jan 09 '13 at 05:00

1

I believe what the Intel documentation is saying is that the mechanics of the x86 chip will ensure that the other processors always see the writes in a consistent order.

So the other processors will only ever see one of the following results when reading that memory location:

value before either write (I.e. the read preceeded both writes)
value after processor 0's write (I.e. as if processor 2 wrote first, and then processor 0 overwrote)
value after processor 2's write (I.e. as if processor 0 wrote first and then processor 2 overwrote)

It won't be possible for processor 1 to see the value after processor 0's write, but at the same time have processor 3 see the value after processor 2's write (or vice versa).

Keep in mind that since intra-processor re-ordering is allowed (see section 8.2.3.5) processor's 0 and 2 may see things differently.

answered Jan 09 '13 at 05:00

Chamila Chulatunga

4,626
13
17

Thank you, and +1 for the thoughtfulness of the answer, but when you say that "it won't be possible," why won't it, please? I cannot imagine the mechanism that could make it impossible. Cache snooping could not do it, because there exists nothing to provoke the snoop: there are just ordinary stores to and loads from cache. The caches are multilevel: hence the problem, I think. At any rate, I am not sure that I fully understand your answer (though it is partially illuminating in any case). – thb Jan 09 '13 at 05:08
I'll look into it, but maybe the chip has a way of correctly invalidating the caches. Can writes definitely stop at L1/L2 cache levels and not propagate invalidation? (I'm not sure either way, will have to read up a bit more) – Chamila Chulatunga Jan 09 '13 at 05:25
You're talking about the case where both writer cores are storing to the *same* location. i.e. this is a (correct) answer to [Will two relaxed writes to the same location in different threads always be seen in the same order by other threads?](https://stackoverflow.com/q/27333311). But Intel's documentation is talking about *any* two stores, regardless of same or different location. They don't need to make the distinction because x86's Total Store Order memory model requires the existence of a single global order. – Peter Cordes Jun 26 '18 at 00:24

Benoit · Answer 3 · 2013-01-09T15:18:46.673

0

Ouch, this is a tough question! But I'll try...

the writes go no deeper than L2

Basically this is impossible since Intel uses inclusive caches. Any data written to L1, will also takes place in L2 and L3, unless you prevent from caching by disabling them through CR0/MTRR.

That being said, I guess there are arbitration mechanisms: processors issue a request to write data and an arbiter selects which request is granted from among the pending requests from each of the request queues. The selected requests are broadcasted to the snoopers, and to caches then. I suppose it would prevent from race, enforcing the consistent order seen by processors other than the one performing the request.

edited Jan 09 '13 at 15:18

answered Jan 09 '13 at 14:45

Benoit

3,667
23
26

I see. I had not understood that. So the only purpose of L1 and L2 is to speed loads. It is L3 that speeds stores. Is that right? This must mean that, in a racked server -- where several physical CPU packages might be mounted in separate sockets on the server's motherboard, each package with its own L3 cache -- the chipset forbids the same, noninvalidated cache line from ever being kept in two, different L3 caches. That is, when package 0 loads a cache line from main memory, the chipset must first invalidate that cache line on package 1. You're right: this is a tough question. – thb Jan 09 '13 at 15:26
1) Indeed, inclusive caches speed up the loads, but I can't see it would speed up stores. Could you elaborate on this? 2) If placing you in the case of multisockets board, you are talking of a NUMA system; which is quite different of SMP when talking about memory coherency. I think your question is very restricted to SMP. – Benoit Jan 09 '13 at 15:40
Thank you. To answer: 1) You have already elaborated, in your main answer above. I doubt that I could have said it better. 2) Your objection re NUMA answers my question. You are right: though I had not realized it, my question is indeed SMP-specific. Therefore, since the SMP question alone has been hard enough for today, let us defer NUMA questions to another day. (And I'll see if there does not exist some kind of "SMP" tag to add to the original question.) – thb Jan 09 '13 at 15:55
If you have some time, you might add your comment to your main answer above. It adds significantly to the answer. – thb Jan 09 '13 at 15:56
No, Intel CPUs use a write-back L1d and L2. L3 is only tag-inclusive, if it's inclusive at all. The 2nd paragraph is vague enough to be compatible with MESI (cores get permission to write by taking exclusive ownership of a cache line with an RFO that invalidates all other copies). The implication of write-through to a single shared inclusive cache is totally bogus, though. A multi-socket system doesn't share caches between sockets, only for cores within each socket, but the coherency mechanism isn't fundamentally different. – Peter Cordes Jun 26 '18 at 04:08
Your answer remains +1 by me, but a new answer (recently posted) must now be *accepted.* – thb Jul 02 '18 at 03:13

Concurrent stores seen in a consistent order

3 Answers3

Linked