6

Summary

Consider the following loop:

loop:
movl   $0x1,(%rax)
add    $0x40,%rax
cmp    %rdx,%rax
jne    loop

where rax is initialized to the address of a buffer that is larger than the L3 cache size. Every iteration performs a store operation to the next cache line. I expect that the number of RFO requests sent from the L1D to the L2 to be more or less equal to the number of cache lines accessed. The problem is that this seems to be only the case when I count kernel-mode events even though the program runs in user-mode, except in one case as I discuss below. The way the buffer is allocated does not seem to matter (.bss, .data, or from the heap).

Details

The results of my experiments are shown in the tables below. All of the experiments are performed on processors with hyperthreading disabled and all hardware prefetchers enabled.

I've tested the following three cases:

  • There is no initialization loop. That is, the buffer is not accessed before the "main" loop shown above. I'll refer to this case as NoInit. There is only one loop in this case.
  • The buffer is first accessed using one load instruction per cache line. Once all the lines are touched, the main loop is then executed. I'll refer to this case as LoadInit. There are two loops in this case.
  • The buffer is first accessed using one store instruction per cache line. Once all the lines are touched, the main loop is then executed. I'll refer to this case as StoreInit. There are two loops in this case.

The following table shows the results on an Intel CFL processor. These experiments have been performed on Linux kernel version 4.4.0.

enter image description here

The following table shows the results on an Intel HSW processor. Note that the events L2_RQSTS.PF_HIT, L2_RQSTS.PF_MISS, and OFFCORE_REQUESTS.ALL_REQUESTS are not documented for HSW. These experiments have been performed on Linux kernel version 4.15.

enter image description here

The first column of each table contains the names of the performance monitoring events whose counts are the shown in the other columns. In the column labels, the letters U and K represent user-mode and kernel-mode events, respectively. For the cases that have two loops, the numbers 1 and 2 are used to refer to the initialization loop and the main loop, respectively. For example, LoadInit-1K represents the kernel-mode counts for the initialization loop of the LoadInit case.

The values shown in the tables are normalized by the number of cache lines. They are also color-coded as follows. The darker the green color is the larger the value is with respect to all other cells in the same table. However, the last three rows of the CFL table and the last two rows of the HSW table are not color-coded because some of the values in these rows are too large. These rows are painted in dark gray to indicate that they are not color-coded like the other rows.

I expect that the number of user-mode L2_RQSTS.ALL_RFO events to be equal to the number of cache lines accessed (i.e., a normalized value of 1). This event is described in the manual as follows:

Counts the total number of RFO (read for ownership) requests to L2 cache. L2 RFO requests include both L1D demand RFO misses as well as L1D RFO prefetches.

It says that L2_RQSTS.ALL_RFO may not only count demand RFO requests from the L1D but also L1D RFO prefetches. However, I've observed that the event count is not affected by whether the L1D prefetchers are enabled or disabled on both processors. But even if the L1D prefetchers may generated RFO prefetches, the event count then should be at least as large as the number of cache lines accessed. As can be seen from both tables, this is only the case in StoreInit-2U. The same observation applies to all of the events show in the tables.

However, the kernel-mode counts of the events are about equal to what the user-mode counts are expected to be. This is in contrast to, for example, MEM_INST_RETIRED.ALL_STORES (or MEM_UOPS_RETIRED.ALL_STORES on HSW), which works as expected.

Due to the limited number of PMU counter registers, I had to divide all the experiments into four parts. In particular, the kernel-mode counts are produced from different runs than the user-mode counts. It doesn't really matter what is being counted in the same. I think it's important to tell you this because this explains why some user-mode counts are a little larger than the kernel-mode counts of the same events.

The events shown in dark gray seem to overcount. The 4th gen and 8th gen Intel processor specification manuals do mention (problem HSD61 and 111, respectively) that OFFCORE_REQUESTS_OUTSTANDING.DEMAND_RFO may overcount. But these results indicate that it may be overcounted by many times, not by just a couple of events.

There are other interesting observations, but they are not pertinent to the question, which is: why are the RFO counts not as expected?

Hadi Brais
  • 18,864
  • 3
  • 43
  • 78
  • 1
    Doesn't Linux implement COW by allocating + zeroing a page on demand? (on the first *write*). So after returning to user-space after a store #PF, the whole page is hot in L1d when the store instruction re-runs. – Peter Cordes Mar 05 '19 at 03:19
  • Are the K columns kernel only, or kernel + user? – BeeOnRope Mar 05 '19 at 07:04

1 Answers1

6

You didn't flag your OS, but let's assume you are using Linux. This stuff would be different on another OS (and perhaps even within various variants of the same OS).

On a read access to an unmapped page, the kernel page fault handler maps in a system-wide shared zero page, with read-only permissions.

This explains columns LoadInit-1U|K: even though your init load is striding over a virtual area of 64 MB performing loads, only a single physical 4K page filled with zeros is mapped, so you get approximately zero cache misses after the first 4KB, which rounds to zero after your normalization.1

On a write access to an unmapped page, or to the read-only shared zero page, the kernel will map a a new unique page on behalf of the process. This new page is guaranteed to be zeroed, so unless the kernel has some known-to-be-zero pages hanging around, this involves zeroing the page (effectively memset(new_page, 0, 4096)) prior to mapping it.

That largely explains the remaining columns except for StoreInit-2U|K. In those cases, even though it seems like the user program is doing all the stores, the kernel ends up doing all of the hard work (except for one store per page) since as the user process faults in each page, the kernel writes zeros to it, which has the side effect of bringing all the pages into the L1 cache. When the fault handler returns, the triggering store and all subsequent stores for that page will hit in the L1 cache.

It still doesn't fully explain StoreInit-2. As clarified in the comments, the K column actually includes the user counts, which explains that column (subtracting out the user counts leaves it at roughly zero for every event, as expected). The remaining confusion is why L2_RQSTS.ALL_RFO is not 1 but some smaller value like 0.53 or 0.68. Maybe the event is undercounting, or there is some micro-architectural effect that we're missing, like a type of prefetch that prevents the RFO (for example, if the line is loaded into the L1 by some type of load operation before the store, the RFO won't occur). You could try to include the other L2_RQSTS events to see if the missing events show up there.

Variations

It doesn't need to be like that on all systems. Certainly other OSes may have different strategies, but even Linux on x86 might behave differently based on various factors.

For example, rather than the 4K zero page, you might get allocated a 2 MiB huge zero page. That would change the benchmark since 2 MiB doesn't fit in L1, so the LoadInit tests will probably show misses in user-space on the first and second loops.

More generally, if you were using huge pages, the page fault granularity would be changed from 4 KiB to 2 MiB, meaning that only a small part of the zeroed page would remain in L1 and L2, so you'd get L1 and L2 misses, as you expected. If your kernel ever implements fault-around for anonymous mappings (or whatever mapping you are using), it could have a similar effect.

Another possibility is that the kernel may zero pages in the background and so have zero pages ready. This would remove the K counts from the tests, since the zeroing doesn't happen during the page fault, and would probably add the expected misses to the user counts. I'm not sure if the Linux kernel ever did this or has the option to do it, but there were patches floating around. Other OSes like BSD have done it.

RFO Prefetchers

About "RFO prefetchers" - the RFO prefetchers are not really prefetchers in the usual sense and they are unrelated to the L1D prefetchers can be turned off. As far as I know "RFO prefetching" from the L1D simply refers to sending an RFO request either for (a) a store when its address is calculated (i.e., when the store data uop executes), but before it retires or (b) for stores in the store buffer which are nearing but have not reached the head of the store buffer.

Obviously when a store gets to the head of the buffer, it's time to send an RFO, and you wouldn't call that a prefetch - but why not send some requests for the second-from-the-head store too, and so on (case b)? Or why not check the L1D as soon as the store address is known (as a load would) and then issue a speculative RFO prefetch if it misses? These may be known as RFO prefetches, but they differ from a normal prefetch in that the core knows the address that has been requested: it is not a guess.

There is speculation in the sense that getting additional lines other than the current head may be wasted work if another core sends an RFO for that line before the core has a chance to write from it: the request was useless in that case and just increased coherency traffic. So there are predictors that may reduce this store buffer prefetch if it fails too often. There may also be speculation in the sense that store buffer prefetch may sent requests for junior stores which haven't retired, at the cost of a useless request if the store ends up being on a bad path. I'm not actually sure if current implementations do that.


1 This behavior actually depends on the details of the L1 cache: current Intel VIPT implementations allow multiple virutal aliases of the same single line to all live happily in L1. Current AMD Zen implementations use a different implementation (micro-tags) which don't allow the L1 to logically contain multiple virtual aliases, so I would expect Zen to miss to L2 in this case.

BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
  • 2
    I think you're suggesting that the `0.01` columns for HSW are for the user-space store that triggered the page fault (1 line per page). But those are in rows for counters like L1D_REPLACEMENT and L2_RQSTS_ALL_RFO. A page-fault store isn't going to evict anything from L1d, and certainly not trigger an RFO when there's no physical address (the noinit and storeinit-1U cases are doing stores to hardware-unmapped virtual pages, not read-only-mapped). Possibly there's an effect there from page-walks in user-space fetching through L1d (I think). Or else it's noise, because we don't see it in CFL – Peter Cordes Mar 05 '19 at 06:09
  • @PeterCordes Good point, I remember just thinking well there should be 1 missing store out of 64 in user space and scrolled up and sure enough there was the 0.01 but as you point out it doesn't obviously come from that store. It's probably just any old bit of noise, a context switch, etc. – BeeOnRope Mar 05 '19 at 06:57
  • Aha! I will do additional experiments to find an explanation for the `StoreInit-2` event counts. – Hadi Brais Mar 05 '19 at 17:23
  • My understanding is that all Debian-based kernels zero pages on demand. But Windows for example has a background thread that zeroes reclaimed pages and puts back into a pool for later allocations. – Hadi Brais Mar 05 '19 at 17:24
  • @PeterCordes and Bee, the 0.01% user-mode event counts have a large standard deviation (at least 5% depending on the event). I think these are from the code that programs and reads the performance counters. – Hadi Brais Mar 05 '19 at 17:36
  • @HadiBrais - are the K columns kernel + user, or kernel only? Your description seems to indicate kernel only, but then I'm having trouble parsing the note about "this explains why some user-mode counts are a little larger than the kernel-mode counts of the same events" which would seem to imply the kernel counts would include the user counts (i.e., that U <= K). My explanation is assuming K is kernel only, and would change if K is k + u. – BeeOnRope Mar 05 '19 at 18:13
  • @HadiBrais - I see: can you show any counter value where the user mode value >> kernel mode value (even for a different test or counter like plain ALU stuff you _know_ doesn't invoke the kernel)? It is very suspicious to me that K >= U (roughly) in every column, and also in the cases the K count is expected to be zero it is instead almost exactly the same as the U count. Looks like a bug to me that causes the U counts to be included in K. – BeeOnRope Mar 05 '19 at 18:25
  • 2
    No I was wrong. Additional testing shows that the K columns do actually count both user-mode and kernel-mode events. Your suspicion (and my earlier suspicion) is correct. This explains `StoreInit-2`. I've also tested with `mmap(MAP_POPULATE)` and sure enough the `L2_RQSTS.ALL_RFO` nomalized counts for `StoreInit-1U` and `StoreInit-2U` are both 1. I think this fully answers the question now. Thanks. You can edit that paragraph where ask questions about `StoreInit-2`. BTW, `LoadInit1-U|K` and `StoreInit2-U|K` should be `LoadInit-1U|K` and `StoreInit-2U|K`, respectively. – Hadi Brais Mar 05 '19 at 18:33
  • @HadiBrais - OK, it makes more sense. One thing I still understand though is why `StoreInit-2U` `ALL_RFO` isn't 1, but 0.56 and 0.68 on the two systems? Maybe the event definition [is bugged](https://github.com/travisdowns/uarch-bench/wiki/perf-stat-recipies#l2_rqsts) - although in that page I find that while `HIT` is bugged `ALL` isn't. So maybe there is a real micro-architectural effect there. Does it change with PF off? – BeeOnRope Mar 05 '19 at 18:39
  • Yes. I've observed that the L2 streamer prefetcher seems to prefetch lines into the L1 because when I switch this particular prefetcher off, the `ALL_RFO` count does become 1. This is the case on both HSW and CFL. – Hadi Brais Mar 05 '19 at 18:46
  • The description of the `L2_RQSTS.ALL_RFO` event in the manual is ambiguous because it uses the phrase `L1D RFO prefetches` which suggests that one or both L1 prefetchers, starting with Skylake, can prefetch for stores. If it used the phrase `RFO prefetches` instead (that is without "L1D") then I think it would be less ambiguous. I'm still not totally confident that the L1D prefetchers cannot prefetch for stores starting with Skylake. Perhaps they are not aggressive which explains they are not being observed using the perf event counts. Dedicated tests are need to prove that 100%. – Hadi Brais Mar 05 '19 at 18:57
  • It makes sense that the L2 streamer prefetcher may prefetch for stores into the L1 if the L1 prefetchers cannot prefetch for stores, so I think this is very plausible. – Hadi Brais Mar 05 '19 at 19:03
  • @HadiBrais Do you mean the L2 prefetcher would not only get lines from the outer cache levels, but also then "push" lines up into the L1? Interesting idea although I have never heard of it. Stores don't need prefetching like loads do since they are buffered anyways and have no "output", so prefetching can happen simply by examining the addresses in the store buffer. – BeeOnRope Mar 05 '19 at 19:22
  • Yes. That would explain the `ALL_RFO` counts. Note that cacheable writeback stores may benefit from prefetching similar to loads, it's just that it's not that important for stores because of buffering in the store buffer. So it makes sense that there is only one prefetcher that can prefetch for stores and that it can prefetch into the L1, L2, and L3. I don't think that the `ALL_RFO` event undercounts. – Hadi Brais Mar 05 '19 at 19:28
  • In contrast, there is no need for an L2 prefetcher to prefetch for loads into the L1 because the L1 prefetchers can already do that. – Hadi Brais Mar 05 '19 at 19:30
  • I don't find the explanation compelling. Yes, it is one explanation, but inventing a totally new, undocumented, prefetcher that goes in the "reverse" direction seems like an unlikely one. It would be hard for the L2 to prefetch for L1, because it doesn't see the access stream: it only sees misses: at first it will work, then it will stop working since L1 is getting hits, then stop, etc. That's why one reason existing prefetchers fetch downwards, not upwards. If they wanted this behavior, why not implement it in the L1 like the other prefetchers? – BeeOnRope Mar 05 '19 at 19:32
  • 1
    I think the answer is that there is already a very effective "RFO prefetcher" - and it examines the 40+ entry store buffer for upcoming stores to prefetch. There is no need to predict anything since you already have the actual store addresses in the store buffer. It is unlike loads since stores only become a bottleneck if the store buffer fills up, and at that point RFO prefetching is guaranteed to have all the store addresses it needs to work, so there is almost zero need for address-predictive prefetching for stores, unlike loads. This is described in Intel patents, too. – BeeOnRope Mar 05 '19 at 19:34
  • Store prefetching is useful when there is a long stream of stores that miss in the L1. Instead of waiting until the addresses are calculated and waiting for available LFBs to send the RFO requests, a hardware prefetcher can potentially make better use of the larger parallelism avaiable at the L2 to prefetch into the L2. Also it may make better use of the L1-L2 bus (higher utilization) by prefetching into the L1 to avoid allocating LFBs for RFOs in the first place, making it less likely for the store buffer to become full. – Hadi Brais Mar 05 '19 at 19:40
  • The first makes sense and definitely happens: the L2 prefetcher does prefetch store requests _downwards_ to the L3. So the higher parallelism outwards is achieved. We are discussing your proposed mechanism where a prefetcher located _in_ L2 also somehow pushes the lines up into L1. How can this save LFB? Lines go in and out of the L1 through the LFBs (and perhaps write-back buffers for evicted dirty lines), so are you also proposing that the L2 prefecher has a special path to insert lines into L1 that isn't the LFB? Why not just add more LFBs? – BeeOnRope Mar 05 '19 at 19:44
  • There may be no need for a dedicated path between the L1 and L2 to push lines into the L1. The same L1-L2 bus can be used. When the L2 pushes a line into the L1, the L1 can either fill it in if there is a port available in that cycle or discard it otherwise. Adding more LFBs requires more area. – Hadi Brais Mar 05 '19 at 19:48
  • When there is a long stream of stores that miss in L1, _the L1 itself_ prefetches those pending stores into L1 by examining the store buffer: this is the so-called "RFO prefetch". The scenario where an additional "address predicting" store prefetcher is needed are basically nil (if you flesh out most realistic scenarios enough you end up figuring out that store buffering + RFO prefetching solves it), and if it was needed it would go in the L1 not the L2. – BeeOnRope Mar 05 '19 at 19:49
  • If the L2 prefetcher is using the same LFBs as everything else, then we are back where we started: there is almost no benefit (the analogy with load prefetching is very weak), and if it did exist it would just be in the L1 like all the other prefetchers ever, more or less. – BeeOnRope Mar 05 '19 at 19:50
  • I think the prefetching logic for tracking and predicting accesses is different for loads and stores. Such logic already exists in the L2 prefetcher. Adding it to the L1 prefetchers incurs hardware overhead. – Hadi Brais Mar 05 '19 at 19:52
  • I agree it's unusual for the L2 prefetcher to prefetch into the L1. But if Intel found that there is a way to do that that may improve perf by say 1% on average for specific types of benchmarks with basically no hardware overhead. Then the benefit is compelling. – Hadi Brais Mar 05 '19 at 19:55
  • Sorry, we are just going to have to agree to disagree on this one. If you invent a new hardware mechanism that explains the specific deviation every time a PMC doesn't match the expected value, without any support from existing knowledge or patents (Intel seems to patent everything), and especially one inconsistent with existing practices and mechanisms - you are going to end up with a very "interesting" CPU :). – BeeOnRope Mar 05 '19 at 20:22
  • In no particular order, the simplest explanations are: (1) The test doesn't do exactly what you thought it did (2) The counters don't count exactly what you thought they did (or are badly defined) (3) an existing known hardware mechanism or behavior that you hadn't considered accounts for the difference. Less likely, but still more likely that the discovery of a totally new hardware mechanism is stuff like misbehaving counters. – BeeOnRope Mar 05 '19 at 20:24
  • One plausible candidate that I can think of is if an RFO request from L1 reaches L2 while the line is not yet in L2 but is in the process of being fetched, it counts neither as a miss nor a hit but in another category, analogously with the L1 events where "fb bit" is a third category separate from "L1 hit" and "L1 miss". – BeeOnRope Mar 05 '19 at 21:26
  • You hypothesize 2 times when it could make sense to send an RFO: at store-address exec, and when nearing commit. Another possible time might be at retirement: you then known it's non-speculative. I guess you could even consider sending an RFO at any of those points, with decreasing thresholds on how many free LFBs you need before triggering. (e.g. if they're almost all free, go for it at store-address exec.) You'd need a bit in each SB entry to track whether a request had already been sent. (Or since there aren't a power-of-2 number of LFB, just a valid vs. invalid LFB-number.) – Peter Cordes Feb 28 '21 at 07:15
  • I guess "at retirement" is fairly similar to "near the head of the senior store buffer" if you imagine that the latter option is looking at the N stores near the head of the senior store buffer: any time the number of stores in the senior buffer is <= N this boils down to "at retirement" since a store becomes eligible as soon as becomes senior. In the case that stores are starting to queue up due to a lot of misses, they start to diverge: I guess "at retirement" looks worse here since it is working at the wrong end of \ – BeeOnRope Feb 28 '21 at 07:28
  • the queue of pending stores (should prioritize near to commit, right?) and also because I guess it has to make a Y/N decision for each store and in practice this would result in kind of random assignment of prefetches depending on the LFB threshold when the store retired, compared to the end-of-queue approach which would try to constantly keep the oldest N stores in flight. – BeeOnRope Feb 28 '21 at 07:30
  • 1
    I am curious how this all works but not yet curious enough to try to test it. What I do know is that there definitely is some sort of prefetch, based on the performance of random independent store misses: they get an MLP of close to 10 on SKL, indicating that "almost all" of the LFBs can be used by this approach if the conditions are right. – BeeOnRope Feb 28 '21 at 07:31
  • @Peter - in the literature the two suggested times seem to be at exec and at retire (your suggestion). So maybe my "when nearing head of store buffer" is just nonsense (after all, you want to start it early). – BeeOnRope Mar 03 '21 at 23:58