Why does the number of uops per iteration increase with the stride of streaming loads?

Question

Consider the following loop:

.loop:
    add     rsi, OFFSET    
    mov     eax, dword [rsi]
    dec     ebp
    jg .loop

where OFFSET is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. Presumably, on Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. Therefore, the only limit on the buffer size is the number of virtual pages. So we can easily experiment with very large buffers.

The loop consists of 4 instructions. Each instruction is decoded into a single uop in the fused and unfused domain on Haswell. There is also a loop-carried dependency between the successive instances of add rsi, OFFSET. Therefore, under idle conditions where the load always hit in the L1D, the loop should execute at about 1 cycle per iteration. For small offsets (strides), this is expected thanks to the IP-based L1 streaming prefetcher and the L2 streaming prefetcher. However, both prefetchers can only prefetch within a 4K page and the maximum stride supported by the L1 prefetcher is 2K. So for small strides, there should be about 1 L1 miss per 4K page. As the stride increases, the total number of L1 misses and TLB misses will increase and performance will deteriorate accordingly.

The following graph shows various interesting performance counters (per iteration) for strides between 0 and 128. Note that the number of iterations is constant for all experiments. Only the buffer size changes to accommodate the specified stride. In addition, only user-mode performance events are counted.

The only weird thing here is the number of retired uops is increasing with the stride. It goes from 3 uops per iteration (as expected) to 11 for stride 128. Why is that?

Things only get weirder with larger strides, as the following graph shows. In this graph, the strides range from 32 to 8192 with 32-byte increments. First, the number of retired instructions increases linearly from 4 to 5 at stride 4096 bytes after which it remains constant. The number of load uops increases from 1 to 3 and the number of L1D load hits remains 1 per iteration. Only the number of L1D load misses makes sense to me for all strides.

The two obvious effects of larger strides are:

The execution time increases and so more hardware interrupts will occur. However, I'm counting user-mode events, so interrupts should not interfere with my measurements. I've also repeated all experiments with taskset or nice and got the same results.
The number of page walks and page faults increases. (I've verified this but I'll omit the graphs for brevity.) Page faults are handled by the kernel in kernel-mode. According to this answer, page walks are implemented using dedicated hardware (on Haswell?). Although the link that the answer is based on is dead.

To investigate further, the following graph shows the number of uops from microcode assists. The number of microcode assist uops per iteration increases until it reaches the maximum value at stride 4096, just like with the other performance events. The number of microcode assist uops per 4K virtual page is 506 for all strides. The "Extra UOPS" line plots the number of retired uops minus 3 (the expected number of uops per iteration).

The graph shows that the number of extra uops is slightly larger than half of the number of microcode assist uops for all strides. I don't know what this means, but it could be related to page walks and could be the reason for the observed perturbation.

Why are the numbers of retired instructions and uops per iteration increasing for larger strides even though the number of static instructions per iteration is the same? Where is the interference coming from?

The following graphs plot the number of cycles per iteration against the number of retired uops per iteration for different strides. The number of cycles increases much more quickly than the number of retired uops. By using linear regression, I found:

cycles = 0.1773 * stride + 0.8521
uops = 0.0672 * stride + 2.9277

Taking the derivatives of both functions:

d(cycles)/d(stride) = 0.1773
d(uops)/d(stride) = 0.0672

This means that the number of cycles increase by 0.1773 and the number of retired uops increase by 0.0672 with each 1 byte increment in stride. If interrupts and page faults were indeed the (only) cause of perturbation, shouldn't both rates be very close?

Yes, page walks use dedicated hardware since P6, not microcoded uops. @Bee says L1 misses "cost" an extra uop executed, apparently they get replayed or something. [AVX 512 improvements?](https://stackoverflow.com/posts/comments/91993393). — Peter Cordes, Sep 26 '18 at 23:46
About the replays, for every level of the cache you miss it seems there is one more p23 uop. I.e., hit in L1 is 1 uop, hit in L2, 2 uops, hit in L3 3 uops (maybe that's where it stops). I think maybe what happens is that the scheduler is always optimistic: it doesn't know what level of the cache you'll hit in, so at every chance it wakes up the dependent operation at the time for the best possible hit: 4/5 cycles for L1, 12 cycles for L2, etc. So every time you miss you get an extra uop. There are other cases where you get a lot of uops too, e.g., if the 4-cycle fast path fails. — BeeOnRope, Sep 27 '18 at 00:07
@BeeOnRope: I'd be surprised for L3, the latency depends on ring-bus contention so it would be hard for the scheduler to predict the exact cycle to expect a result. If it was basing it on a notice of incoming data a cycle before it's actually ready, there wouldn't be false positives. (Or maybe there's a notification even for misses, so perf counters can count l3 hit vs. miss when the l3 miss is detected instead of when the DRAM result arrives?) — Peter Cordes, Sep 27 '18 at 00:15
Right, I don't think this is related to the effect here (since instructions also vary), but I think you get 3 uops for an L3 hit: the two failed uops for L1 and L2, and then the successful uop for the L3 hit: this doesn't require the L3 uop to actually issue at a deterministic time though, maybe at this point it just uses some other wakeup mechanism, same as the miss-to-DRAM case. So I meant the uop count increases up to and including L2-misses L3-hits, not necessarily that the final uop L3 hits are scheduled optimistically (but I think yes for L1 and L2). — BeeOnRope, Sep 27 '18 at 00:18
It would be awesome to see page fault graphs that you omitted for brevity since I think those are among the most important graphs! You can't deduce them from first principles easily due to things like fault-around. — BeeOnRope, Sep 27 '18 at 02:23
@BeeOnRope I'm going to post another question that includes those graphs because I have a different question regarding page walks for the same code. I felt like the two questions should be posted separately. I'll try to post the other question as soon as I can. — Hadi Brais, Sep 27 '18 at 02:25
Cool. Basically I want to know if there is one page fault per accessed page. It is reasonable to expect less due to "fault around" but I don't know that this works for the `bss` section. — BeeOnRope, Sep 27 '18 at 02:26
@BeeOnRope There are multiple page walks per page and the number varies in a pattern that I don't understand (which is the question). What is "fault around"? — Hadi Brais, Sep 27 '18 at 02:29
On Linux, when a page-fault occurs, the OS may update the page table for additional "nearby" pages (on my system 15 extra pages) if they are resident. This means that page-faults are reduced by 16x on my system since each fault actually adds 16 pages. This works for file-backed pages, but perhaps not for bss which is special (implicitly maps the zero page or something like that). — BeeOnRope, Sep 27 '18 at 02:32
@PeterCordes and Hadi - about your first two comments: even if page walks _that hit_ are implemented in hardware with no micro-coded ops, here we are talking about page walks _that miss_, right? I wouldn't be at all surprised if that path had a ton of micro-code on it. — BeeOnRope, Sep 27 '18 at 03:03
About terminology "in hardware" to me doesn't mean "no microcode" it just means "not in software". I.e., x86 has always had a hardware page walker because the OS doesn't need to get involved. Peter is saying "_dedicated_ hardware" to mean that it doesn't use micro-code and generally doesn't interfere with the core (I think). However, as I said immediately above (and @Peter could confirm) that claim is all about page walks that _find the requested page_ somewhere in the page tables. I.e., that do not cause a page-fault. OTOH, your test is causing a page-fault every time, right? — BeeOnRope, Sep 27 '18 at 03:30
So this then it is less about the _page-walking_ behavior, which is probably mostly invisible to the performance counters (but it does show up as a "miss" so you get a couple of extra load uops as I explain in my answer) and more about the _page-faulting_ behavior which is an entirely different beast and absolutely can't be implemented in "dedicated hardware" since indeed it is mostly implemented in software in the OS PF handler. As I mentioned in my answer, you can also test the page-walk-_hit_ case by mapping pages appropriately. — BeeOnRope, Sep 27 '18 at 03:32
@BeeOnRope That makes sense. Raising the actual page fault requires microcode assist. In that case, the results suggest that it take 506 uops to raise a page fault. — Hadi Brais, Sep 27 '18 at 03:41
Yes, that's why I was saying that the `page-fault` graphs would be very interesting: I expect they follow the exact pattern as the other guys, ramping from 0 to 1 as offset goes from 0 to 4096, then flat. So all the other graphs are just a linear combination of that graph (multiplied by however many ops each pagefault adds to a counter), plus the "baseline" graph which is the ops purely from the loop iteration (equivalent to the values at `OFFSET` 0). — BeeOnRope, Sep 27 '18 at 03:44
About the replay @PeterCordes and Hadi - I'm not 100% confident in replay stuff. I was determining it based on the p2 and p3 PMUs. However, I noticed that in a throughput scenario (i.e., loads not feeding loads) there is no increase in p2 or p3 counts for L2 hits, L3 hits, etc - just 1 uop per load total. In the scenario where I do see it (`./uarch-bench.sh --timer=libpfc --test-name=memory/load-serial/* --extra-events UOPS_DISPATCHED.PORT_2,UOPS_DISPATCHED.PORT_3`) a pointer chasing test, it goes away if I add 2 (but not 1) ALU ops in the path from load to load. — BeeOnRope, Sep 27 '18 at 05:50
So I can't tell if the effect is "real", i.e., an indication of a replay, or if the counters are just double or triple counting or something else. In particular, the performance doens't really change between the two "modes". So I'm not sure what to thing there... — BeeOnRope, Sep 27 '18 at 05:51
@BeeOnRope, misses are not supposed to be counted as extra uops. It doesn't make any sense - the OOO machine has one entry point and must be that way to preserve ordering. Hit/miss indications are much further down the pipe, so they can't allocate additional uops. — Leeor, Sep 28 '18 at 06:04
@Leeor - extra uops can't be _allocated_ (and indeed I don't see extra uops showing uop in the various counters that track them in the front end), but the same allocated uop could be _replayed_ from within the scheduler, right? That is, if the scheduler wakes up a uop at the wrong time and dispatches it but some input wasn't ready, then it has to try again and the "dispatched" counters seem to count this - or else the counters are broken somehow. — BeeOnRope, Sep 28 '18 at 15:24
Even if that's true it doesn't mean it is necessarily the explanation here - we already know there is are a lot of extra activity coming from each page fault do that can presumably explain any extra uops without resorting to extra-uops from misses as an explanation (and I tend to agree that replayed uops seem unlikely to show up in the "retired" counters that Hadi is referring to here - I haven't checked that as they don't exit IIRC on Skylake). — BeeOnRope, Sep 28 '18 at 15:27
@PeterCordes and Hadi - one more update about the replay stuff - after more checking, I found out what was going on: it is the _dependent_ ops that are usually replayed, which is why inserting some ALU ops stopped me from seeing it (since I wasn't looking at `p0156` uops). So basically when a load feeds into a load only a load will be replayed since it's the only dependent op. If you have ALU ops after, the ALU ops will be replayed. Sometimes more than one uop is replayed including not-directly-dependent ones, it seems uops that would execute within one cycle of the load are replayed. — BeeOnRope, Oct 01 '18 at 00:49

BeeOnRope · Accepted Answer · 2018-09-27T05:00:13.963

The effect you see repeatedly across many of the performance counters, where the value increases linearly until stride 4096 after which it stays constant, makes total sense if you assume the effect is purely due to increasing page faults with increasing stride. Page faults affect the observed values because many counters are not exact in the presence of interrupts, page-faults and so on.

For example, take the instructions counter which ramps from 4 to 5 as you progress from stride 0 to 4096. We know from other sources that each page fault on Haswell will count one extra instruction in user mode (and one extra in kernel mode as well).

So the number of instructions we expect is the base of 4 instructions in the loop, plus some fraction of an instruction based on how many page faults we take per loop. If we assume each new 4 KiB page causes a page fault, then the number of page faults per iteration is:

MIN(OFFSET / 4096, 1)

Since each page fault counts an extra instruction, we have then for the expected instruction count:

4 + 1 * MIN(OFFSET / 4096, 1)

which is in perfect agreement with your graph.

So then the rough shape of the sloped graphed is explained for all the counters at once: with the slope depending only on the amount of over-counting per page fault. Then the only remaining question is why a page-fault effects each counter in the way you determined. We've covered instructions already but let's take a peek at the other ones:

MEM_LOAD_UOPS.L1_MISS

You get only 1 miss per page because only the load that touches the next page misses anything (it takes a fault). I don't actually agree that is the L1 prefetcher that results in no other misses: I think you'd get the same result if you turned off the prefetchers. I think you get no more L1 misses since the same physical page backs every virtual page and once you've added the TLB entry all lines are already in L1 (the very first iteration will miss - but I guess you are doing many iterations).

MEM_UOPS_RETIRED.ALL_LOADS

This shows 3 uops (2 extra) per page-fault.

I'm not 100% sure how this event works in the presence of uop replay. Does it always count a fixed number of uops based on the instruction, e.g., the number you'd see in Agner's instruction -> uop tables? Or does it count the actual number of uops dispatched on behalf of the instruction? This is usually the same, but loads replay their uops when they miss at various cache levels.

For example, I have found that on Haswell and Skylake² when a load misses in L1 but hits in L2, you see 2 uops total between the load ports (port2 and port3). Presumably what happens is that the uop is dispatched with the assumption it will hit in L1, and when this doesn't happen (the result is not ready when the scheduler expected it), it gets replayed with new timing anticipating an L2 hit. This is "lightweight" in that it doesn't require any kind of pipeline clear as no wrong-path instructions have been executed.

Similarly for an L3 miss I have observed 3 uops per load.

Given that, it seems reasonable to assume the miss on the new page causes the load uop to be replayed twice (as I have observed), and those uops show up in the MEM_UOPS_RETIRED counter. One may reasonably argue that the replayed uops are not retired, but in some sense retirement is more associated with instructions than uops. Maybe this counter would be better described as "dispatched uops associated with retired load instructions".

UOPS_RETIRED.ALL and IDQ.MS_UOPS

The remaining weirdness is the large number of uops associated with each page. It seems entirely possible that this is associated with the page-fault machinery. You could try a similar test that misses in the TLB, but doesn't take the page-fault (make sure the pages are already populated, e.g., using mmap with MAP_POPULATE).

The difference between MS_UOPS and UOPS_RETIRED doesn't seem that odd since some uops may not retired. Maybe also they count in different domains (I forget if UOPS_RETIRED is fused or unfused domain).

Maybe there is also leakage between user and kernel mode counts in this case.

Cycles versus uop derivative

In the last part of your question, you show that the "slope" of cycles versus offset is about 2.6x larger than the slope of retired uops versus offset.

As above, the effect here stops at 4096 and we expect again this effect is entirely due to page-faults. So the difference in slope just means that a page fault costs 2.6x more cycles than it does uops.

You say:

If interrupts and page faults were indeed the (only) cause of perturbation, shouldn't both rates be very close?

I don't see why. The relationship between uops and cycles can vary widely, by perhaps three order of magnitude: the CPU might execute four uops per cycle, or it might take 100s of cycles to execute a single uop (such as a cache-missing load).

The value of 2.6 cycles per uop is right in the middle of this big range and doesn't strike me as odd: it is a bit high ("inefficient" if you were talking about optimized application code) but here we are talking about page fault handling which is a totally different thing, so we expect long delays.

Studies into over-counting

Anyone interested in over-counting due to page-faults and other events might be interested in this github repository which has exhaustive tests for "determinism" of various PMU events, and where many results of this nature have been noted, including on Haswell. It doesn't however cover all the counters Hadi mentions here (otherwise we'd already have our answer). Here's the associated paper and some easier-to-consume associated slides - they mention in particular that one extra instructions is incurred per page fault.

Here's a quote for the results from Intel:

Conclusions on the event determinism:
1.  BR_INST_RETIRED.ALL (0x04C4)
a.  Near branch (no code segment change): Vince tested 
    BR_INST_RETIRED.CONDITIONAL and concluded it as deterministic. 
    We verified that this applies to the near branch event by using 
    BR_INST_RETIRED.ALL - BR_INST_RETIRED.FAR_BRANCHES.
b.  Far branch (with code segment change): BR_INST_RETIRED.FAR_BRANCHES 
    counts interrupts and page-faults. In particular, for all ring 
    (OS and user) levels the event counts 2 for each interrupt or 
    page-fault, which occurs on interrupt/fault entry and exit (IRET).
    For Ring 3 (user) level,  the counter counts 1 for the interrupt/fault
    exit. Subtracting the interrupts and faults (PerfMon event 0x01cb and
    Linux Perf event - faults), BR_INST_RETIRED.FAR_BRANCHES remains a 
    constant of 2 for all the 17 tests by Perf (the 2 count appears coming
    from the Linux Perf for counter enabling and disabling). 
Consequently, BR_INST_RETIRED.FAR_BRANCHES is deterministic.

So you expect one extra instruction (in particular, a branch instruction), per page-fault.

¹ In many cases this "inexactness" is still deterministic - in that the over- or under-counting always behaves in the same way in the presence of the external event, so you may be able to correct for it if you also track how many of the relevant events have happened.

² I don't mean to limit it to those two micro-architectures: they just happen to be the ones I've tested.

I'm familiar with Weaver's great work. Table 6 mentions that the instruction count can be perturbed by interrupts and page faults. Table 7 seems to suggest that the number of retired uops on Haswell is pretty deterministic. Section 3.1.2 mentions that microcode uops might also be counted towards retired uops. My experiments show that the number of microcode uops per page is constant for all strides but the number of retired uops per page only becomes constant at stride 4096. I've edited my question. Peter said page walks don't require micrcode uops, but I feel this is not precise. — Hadi Brais, Sep 27 '18 at 01:25
Good point about the L1 prefetcher. But shouldn't we get only one miss then or few misses perhaps (that is, no correlation with the stride)? — Hadi Brais, Sep 27 '18 at 03:36
@HadiBrais - your tests reflect that there are a large number of micro-coded uops, and uops in general, associated with every page-fault, which isn't surprising. The number of these is constant per-page (which means constantly increasing with offset until 4096). The number of retired uops per page is obvious decreasing with stride since smaller offsets means many more iterations per page. Am I missing something? I think the stride thing is perhaps leading to confusion: all the graphs look easily explained by X work per iteration and Y work per page-fault. — BeeOnRope, Sep 27 '18 at 03:38
@HadiBrais - of course L1 misses is "correlated with the stride" because the stride is linearly correlated with the number of page-faults and the misses come from the TLB miss or page fault. Again I think the whole stride thing is being confusing: if you plotted everything "per page" after subtracting out the "expected values" (X in my last comment) from the actual iteration everything would be flat. The extra uops aren't coming from the extra "stride" they are coming from all the page-faults which a proportional to stride due to the design of the test. — BeeOnRope, Sep 27 '18 at 03:45
I plotted the graphs for uops retired per page (minus the expected number) and instructions retired per page (minus the expected number). Both never become flat and at some point are negative. When I don't subtract the expected number, the number of uops per page only becomes flat after 4096. The number of instructions per page never becomes flat. — Hadi Brais, Sep 27 '18 at 04:10
I can confirm that the number of minor page faults per page is 1 for all strides and the number of major page faults is zero for all strides. — Hadi Brais, Sep 27 '18 at 04:24
@HadiBrais - when you say "per page" how are you calculating the number of pages? Do you account for the fact that some pages are skipped once offset > 4096? All the graphs in your question currently make sense to me at least in terms of their structure; do they make sense to you? If there are other graphs that don't maybe post them. In case it wasn't clear: all the graphs go flat or change behavior at 4096 since that's when increasing offset no longer leads to more page-faults per iteration, since at that point we are already at the "max" of 1 page fault every iteration. — BeeOnRope, Sep 27 '18 at 04:32
Yes I did that. The difference (instructions per page - (4*iterations per page)) gets flat at stride 512 (the value becomes about 1) . The difference (uops per page - (3*iterations per page)) goes from -878 to 263 for strides 32-8192. It never becomes flat. — Hadi Brais, Sep 27 '18 at 04:45
@HadiBrais - maybe you should show your work or update the question because I'm not seeing it. For example, the graph you included above shows 3 uops per iteration at offset 0, increasing from there, so I'm not seeing how `(uops per page - (3*iterations per page))` can ever be negative unless you are talking about different numbers. — BeeOnRope, Sep 27 '18 at 04:54
I can send you the Excel spreadsheet that contains all the raw results to your email if you want. Although it's a bit messy. You can check everything. I might have made some mistake somewhere, but it doesn't look like it. — Hadi Brais, Sep 27 '18 at 04:59
I don't really want to dig through your Excel since I may be misunderstanding something. Just focus on my simple claim above. To elaborate: you are saying that `(uops per page - (3*iterations per page))` is negative, which implies `uops per page < 3 * iterations per page` which implies `uops < 3 * iterations` which implies `uops/iteration < 3` - those are just elementary algebraic transformations. Agree so far? However, the graphs you published in your question show that `uops/iteration >= 3` everywhere, a direct contradiction. So either I'm not getting something, or in the comments ... — BeeOnRope, Sep 27 '18 at 05:05
... you are talking about different results than the graphs in the question. — BeeOnRope, Sep 27 '18 at 05:05
It's only negative for four strides: 32, 64, 96, and 128. It's hard to see that in the graphs. Then it's always positive and increasing. — Hadi Brais, Sep 27 '18 at 05:11
Weird, because I can see very clearly from [this graph](https://i.stack.imgur.com/Jh3OD.png) and [this one](https://i.stack.imgur.com/L1VRO.png) that the orange line definitely never drops below 3 at those values. Are you talking about a different data set? — BeeOnRope, Sep 27 '18 at 05:16
That graph shows the number of uops per iteration, not per page. These two metrics have very different values. The number of uops per iteration is 2.98 at stride 0 and larger than 3 for all other strides. I'm using the same data set for everything in the question and in my spreadsheets. — Hadi Brais, Sep 27 '18 at 05:23
Have you tried testing with multiple uops waiting for a load result? Do they all get woken up in anticipation of the load putting a result on the forwarding network? — Peter Cordes, Sep 27 '18 at 05:25
@HadiBrais - yes, but I showed you my simple algebra above. `X per page < Y per page` implies `X < Y` by the simple rule of multiplying both sides by "page" (page has to be non-negative for this to work, obviously it is). That's why I asked you how you are calculating "per page" (it's just a transformation of the data showed in the graphs I linked, not a new measurement, right?) - although it shouldn't matter anyways since "per page" cancels out in that inequality. — BeeOnRope, Sep 27 '18 at 05:26
Finally I found an error in my spreadsheet. I was calculating `(uops per page - (3*instructions per page))` instead of `(uops per page - (3*iterations per page))`. Now the uop count is flat at 274 for all strides :) . Now considering `(instructions per page - (4*iterations per page))`. It becomes flat relateivly quickly at stride 512. At stride 32 it's 0.26 and then it increases until it reaches 1 at stride 512 and later. — Hadi Brais, Sep 27 '18 at 05:40
It seems that out of the 506 microcode assist uops per page, 274 are being counted as retired uops. — Hadi Brais, Sep 27 '18 at 05:44
Calculating (loads per page - (1*iterations per page)), it's flat at 2 for all strides. The number of L1 misses per page is itself flat at one per page (without subtraction). If I subtract, it's no longer flat. This means that it's not perturbed. — Hadi Brais, Sep 27 '18 at 05:56
Well the right number of L1 misses to subtract is zero, since you don't expect any misses per iteration (the same physical page is being re-used so the L1 cache is hot the entire test). So the subtraction case should be the same as the no subtraction case for that metric. — BeeOnRope, Sep 27 '18 at 06:03
I forgot that we are using the same physical page AGAIN. Regarding, `(L1 load hits per page - (1*iterations per page))`, it's flat at 0, and `(uop slots per page - (3*iterations per page))` is flat at 270. One perf event left `RESOURCE_STALLS.ANY`, which is not trivial. It better be flat. — Hadi Brais, Sep 27 '18 at 06:14

Hadi Brais · Answer 2 · 2018-09-29T21:15:05.880

I think that @BeeOnRope's answer fully answers my question. I would to like to add some additional details here based on @BeeOnRope's answer and the comments under it. In particular, I'll show how to to determine whether a performance event occurs a fixed number of times per iteration for all load strides or not.

It's easy to see by looking at the code that it takes 3 uops to execute a single iteration. The first few loads might miss in the L1 cache, but then all later load will hit in the cache because all virtual pages are mapped to the same physical page and the L1 in Intel processors in physically tagged and indexed. So 3 uops. Now consider the UOPS_RETIRED.ALL performance event, which occurs when a uop retires. We expect to see about 3 * number of iterations such events. Hardware interrupts and page faults that occur during execution require microcode assist to handle, which will probably perturb the performance events. Therefore, for a specific measurement of a performance event X, the source of each counted event can be:

The instructions of the code being profiled. Let's call this X₁.
Uops used to raise a page fault that occurred due to a memory access attempted by the code being profiled. Let's call this X₂.
Uops used to call an interrupt handler due to an asynchronous hardware interrupt or to raise a software exception. Let's call this X₃.

Hence, X = X₁ +X₂ + X₃.

Since the code is simple, we were able to determine through static analysis that X₁ = 3. But we don't know anything about X₂ and X₃, which may not be constant per iteration. We can measure X though using UOPS_RETIRED.ALL. Fortunately, for our code, the number of page faults follows a regular pattern: exactly one per page accessed (which can be verified using perf). It's reasonable to assume that the same amount of work is required to raise every page fault and so it will have the same impact on X every time. Note that this is in contrast to the number of page faults per iteration, which is different for different load strides. The number of uops retired as a direct result of executing the loop per page accessed is constant. Our code does not raise any software exceptions, so we don't have to worry about them. What about hardware interrupts? Well, on Linux, as long as we run the code on a core that is not assigned to handle mouse/keyboard interrupts, the only interrupt that really matters is the local APIC timer. Fortunately, this interrupt occurs regularly as well. As long as the amount of time spent per page is the same, the impact of the timer interrupt on X will be constant per page.

We can simplify the previous equation to:

X = X₁ + X₄.

Thus, for all load strides,

(X per page) - (X₁ per page) = (X₄ per page) = constant.

Now I'll discuss why this is useful and provide examples using different performance events. We are going to need the following denotations:

ec = total number of performance events (measured)
np = total number of virtual memory mappings used = minor page faults + major page faults (measured)
exp = expected number of performance events per iteration *on average* (unknown)
iter = total number of iterations. (statically known)

Note that in general, we don't know or are not sure of the performance event that we are interested in, which is why we would ever need to measure it. The case of retired uops was easy. But in general, this is what we need to find out or verify experimentally. Essentially, exp is the count of performance events ec but excluding those from raising page faults and interrupts.

Based on the argument and assumptions stated above, we can derive the following equation:

C = (ec/np) - (exp*iter/np) = (ec - exp*iter)/np

There are two unknows here: the constant C and the value we are interested in exp. So we need two equations to be able to calculate the unknowns. Since this equation holds for all strides, we can use measurements for two different strides:

C = (ec₁ - exp*iter)/np₁
C = (ec₂ - exp*iter)/np₂

We can find exp:

(ec₁ - exp*iter)/np₁ = (ec₂ - exp*iter)/np₂
ec₁*np₂ - exp*iter*np₂ = ec₂*np₁ - exp*iter*np₁
ec₁*np₂ - ec₂*np₁ = exp*iter*np₂ - exp*iter*np₁
ec₁*np₂ - ec₂*np₁ = exp*iter*(np₂ - np₁)

Thus,

exp = (ec₁*np₂ - ec₂*np₁)/(iter*(np₂ - np₁))

Let's apply this equation to UOPS_RETIRED.ALL.

stride₁ = 32
iter = 10 million
np₁ = 10 million * 32 / 4096 = 78125
ec₁ = 51410801

stride₂ = 64
iter = 10 million
np₂ = 10 million * 64 / 4096 = 156250
ec₂ = 72883662

exp = (51410801*156250 - 72883662*78125)/(10m*(156250 - 78125))
= 2.99

Nice! Very close to the expected 3 retired uops per iteration.

C = (51410801 - 2.99*10m)/78125 = 275.3

I've calculated C for all strides. It's not exactly a constant, but it's 275+-1 for all strides.

exp for other performance events can be derived similarly:

MEM_LOAD_UOPS_RETIRED.L1_MISS: exp = 0
MEM_LOAD_UOPS_RETIRED.L1_HIT: exp = 1
MEM_UOPS_RETIRED.ALL_LOADS: exp = 1
UOPS_RETIRED.RETIRE_SLOTS: exp = 3

So does this work for all performance events? Well, let's try something less obvious. Consider for example RESOURCE_STALLS.ANY, which measures allocator stall cycles for any reason. It's rather hard to tell how much exp should be by just looking at the code. Note that for our code, RESOURCE_STALLS.ROB and RESOURCE_STALLS.RS are zero. Only RESOURCE_STALLS.ANY is significant here. Armed with the equation for exp and experimental results for different strides, we can calculate exp.

stride₁ = 32
iter = 10 million
np₁ = 10 million * 32 / 4096 = 78125
ec₁ = 9207261

stride₂ = 64
iter = 10 million
np₂ = 10 million * 64 / 4096 = 156250
ec₂ = 16111308

exp = (9207261*156250 - 16111308*78125)/(10m*(156250 - 78125))
= 0.23

C = (9207261 - 0.23*10m)/78125 = 88.4

I've calculated C for all strides. Well, it doesn't look constant. Perhaps we should use different strides? No harm in trying.

stride₁ = 32
iter₁ = 10 million
np₁ = 10 million * 32 / 4096 = 78125
ec₁ = 9207261

stride₂ = 4096
iter₂ = 1 million
np₂ = 1 million * 4096 / 4096 = 1m
ec₂ = 102563371

exp = (9207261*1m - 102563371*78125)/(1m*1m - 10m*78125))
= 0.01

C = (9207261 - 0.23*10m)/78125 = 88.4

(Note that this time I used different number of iterations just to show that you can do that.)

We got a different value for exp. I've calculated C for all strides and it still does not look constant, as the following graph shows. It varies significantly for smaller strides and then slightly after 2048. This means that one or more of the assumptions that there is a fixed amount of allocator stall cycles per page is not valid that much. In other words, the standard deviation of the allocator stall cycles for different strides is significant.

For the UOPS_RETIRED.STALL_CYCLES performance event, exp = -0.32 and the standard deviation is also significant. This means that one or more of the assumptions that there is a fixed amount of retired stall cycles per page is not valid that much.

I've developed an easy way to correct measured number of retired instructions. Each triggered page fault will add exactly one extra event to the retired instructions counter. For example, assume that a page fault occurs regularly after some fixed number of iterations, say 2. That is, every two iterations, a fault is triggered. This happens for the code in the question when the stride is 2048. Since we expect 4 instructions to retire per iteration, the total number of expected retired instructions until a page fault occurs is then 4*2 = 8. Since a page fault adds one extra event to the retired instructions counter, it will be measured as 9 for the two iterations instead of 8. That is, 4.5 per iteration. When I actually measure the retired instructions count for the 2048 stride case, it is very close to 4.5. In all cases, when I apply this method to statically predict the value of the measured retired instruction per iteration, the error is always less than 1%. This is extremely accurate despite of hardware interrupts. I think that as long as the total execution time is less than 5 billion core cycles, hardware interrupts will not have any significant impact on the retired instructions counter. (Each one of my experiments took no more than 5 billion cycles, that's why.) But as explained above, one must always pay attention to the number of faults occurred.

As I have discussed above, there are many performance counters that can be corrected by calculating the per-page values. On the other hand, the retired instructions counter can be corrected by considering the number of iterations to get a page fault. RESOURCE_STALLS.ANY and UOPS_RETIRED.STALL_CYCLES perhaps can be corrected similarly to the retired instructions counter, but I have not investigated these two.

Why does the number of uops per iteration increase with the stride of streaming loads?

2 Answers2

Cycles versus uop derivative

Studies into over-counting

Linked