Why does Linux perf use event l1d.replacement for "L1 dcache misses" on x86?

Question

On Intel x86, Linux uses the event l1d.replacements to implement its L1-dcache-load-misses event.

This event is defined as follows:

Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace.

Perhaps naively, I would have expected perf to use something like mem_load_retired.l1_miss, which supports PEBS and is defined as:

Counts retired load instructions with at least one uop that missed in the L1 cache. (Supports PEBS)

The event values are usually not exactly very close, and sometimes they vary wildly. For example:

$ocperf stat -e mem_inst_retired.all_loads,l1d.replacement,mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired_fb_hit head -c100M /dev/urandom > /dev/null 

 Performance counter stats for 'head -c100M /dev/urandom':

       445,662,315      mem_inst_retired_all_loads                                   
            92,968      l1d_replacement                                             
       443,864,439      mem_load_retired_l1_hit                                     
         1,694,671      mem_load_retired_l1_miss                                    
            28,080      mem_load_retired_fb_hit

There are more than 17 times more "L1 misses" as measured by mem_load_retired.l1_miss as compared to l1d.replacement. Conversely, you can also find examples where l1d.replacement is much higher than the mem_load_retired counters.

What exactly is l1d.replacement measuring, why was it chosen in the kernel, and is it a better proxy for L1 d-cache misses than mem_load_retired.l1_miss?

`l1d.replacements` measure *lines* that miss, I'd assume, instead of instructions that miss. So there's some sense in that. (The name implies it measures evictions or allocations in L1d). But that would also measure store misses, which `L1-dcache-load-misses` claiming not to. Yuck. Looks like yet another reason not to trust those generic event names, along with [how to interpret perf iTLB-loads,iTLB-load-misses](https://stackoverflow.com/q/49933319). — Peter Cordes, Sep 04 '18 at 20:54
@PeterCordes - but the `mem_load_retired` also makes that distinction by breaking L1 load accesses into three categories: `l1_hit`, `l1_miss` and `fb_hit`. So you should only get one `l1_miss` per missed line, more or less, and the rest would be `fb_hit`. Although maybe `fb_hit` isn't working as I think - because if it does I can't reconcile the numbers above. — BeeOnRope, Sep 04 '18 at 21:25
Hmm, can a load miss in L1 and *then* hit in a fill-buffer instead of initiating a new line fill? I haven't played with those events. — Peter Cordes, Sep 04 '18 at 21:29
@PeterCordes - definitely! The fill buffer would be quite terrible if it didn't work that way. The basic idea is when you miss in L1, the next place you look is the fill buffers and if the line you missed in is already in a FB you just sleep the load, since you don't want to allocated a redundant FB. This behavior is pretty critical since in a normal linear access of say `DWORD`s you'd only get one true L1 miss for the first `DWORD` and then 15 more `l1-miss-but-hit-FB` for the next 15 accesses to the same line, and you wouldn't want to fill up all your FBs. — BeeOnRope, Sep 04 '18 at 21:32
Right, I already expected the hardware to work that way, but I mean the `fb_hit` event might not be exclusive with the `l1_miss` event. So a load instruction generates an `l1_miss` if it isn't satisfied on the fast path, and *also* an `fb_hit` event if that happens. Does that fit the data? Very few of your l1d misses are to the same line? `l1d_replacement` seems very low, though, for that many `l1_miss` with few of the misses being `fb_hit`s. Does store-forwarding count as a `l1_miss`? — Peter Cordes, Sep 04 '18 at 21:36
I don't think the `l1_miss` case is inclusive of `fb_hit` because in many cases the `fb_hit` count ends up higher, e.g., `ocperf stat -e mem_inst_retired.all_loads,l1d.replacement,mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired_fb_hit true`. Note that in most workloads the `replacement` value isn't an order of magnitude different, but this was just an interesting one where it is. Good question about store-forwarding. — BeeOnRope, Sep 04 '18 at 21:55
Might want to update your question with some other outputs that rule out `l1_miss` being inclusive of`fb_hit`, then, to narrow down the space for guesswork, or at least mention it. (I guess you're really asking for an authoritative answer, but still we're often inclined to guess.) — Peter Cordes, Sep 04 '18 at 22:01
@PeterCordes another weird thing is the `l1_hit|miss|fb` counts don't add up to exactly the `inst_retired.all_loads` value. Also the language for the "inst retired" events talks about "any uop from the instruction" so I guess some instructions that do two memory loads could increment two counters but only increment the inst counter by 1 (but the observed counting problem is in the opposite direction). — BeeOnRope, Sep 04 '18 at 23:59
With PEBS I guess we should expect the total to be *very* close, they were all exclusive and covered every possibility? Any chance that context-switching or handling of perf interrupts could account for the `75125` discrepancy? perf would collect all the PEBS data at once if an interrupt triggered, though, right? Rather than accumulating `l1_miss` events while collecting the `l1_hit` events? If you're right that there's a real discrepancy, then maybe store-forwarding? Re: multiple accesses per instruction: that's rare unless cache-line splits count. `cmps`, gather, maybe memory-dst `adc`? — Peter Cordes, Sep 05 '18 at 00:35
@Peter - yeah it seems like more than the normal "non atomic reads" issue with perf counters, and I feel like perf stat should be mostly immune to that anyways when you are reading the counters for the lifetime of the application. Good point about PEBS, I'm not totally sure how it works. When you have multiple PEBS events I guess they all go the same buffer? Perf also has this distinction between events that can use "large PEBS" (a buffer for more than one event) and those where they just use size is n buffers but AFAICT it's hard to tell which is being used. — BeeOnRope, Sep 05 '18 at 00:40
Can you construct a test-case with more concurrent misses to the same cache line, and fewer hits? So counts are more evenly distributed between the three events, with no chance for `l1_fb_hit` to be lost in the noise. Maybe randomly select a (normally cold in L1D) cache line, then do 4 dword loads from it? If we can predict what the HW is probably doing, we might divine what the counters mean. — Peter Cordes, Sep 05 '18 at 00:43

Why does Linux perf use event l1d.replacement for "L1 dcache misses" on x86?

0 Answers0

Linked