6

Is it possible to measure the number of successful store-forwarding operations using the performance counters on recent Intel x86 chips?

I see events for ld_blocks.store_forward which measure failed store-forwarding, but it's clear to me if the successful case can be measured.

BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
  • I can't see any performance counter for that job, but the tightest upperbound on that number I can conceive of would probably be `mem_load_uops_retired.l1_hits - ld_blocks.store_forward`. – Iwillnotexist Idonotexist Sep 10 '17 at 02:46

2 Answers2

4

I don't see anything more than you did for SKL, but older uarches may have more details:

For Core2 (what Intel confusingly calls the Core microarchitecture), the optimization manual documents (in B.7 EVENT RATIOS FOR INTEL CORE MICROARCHITECTURE):

B.7.5.2 4K Aliasing and Store Forwarding Block Detection

  1. Loads Blocked by Overlapping Store Rate: LOAD_BLOCK.OVERLAP_STORE/CPU_CLK_UNHALTED.CORE

4K aliasing and store forwarding block are two different scenarios in which loads are blocked by preceding stores due to different reasons. Both scenarios are detected by the same event: LOAD_BLOCK.OVERLAP_STORE. A high value for “Loads Blocked by Overlapping Store Rate” indicates that either 4K aliasing or store forwarding block may affect performance

This may count stalled and successful store-forwarding. (And 4k aliasing, so you need to avoid that or subtract it.)

B.7.5.3 Load Block by Preceding Stores

  1. Loads Blocked by Unknown Store Address Rate: LOAD_BLOCK.STA / CPU_CLK_UNHALTED.CORE

A high value for “Loads Blocked by Unknown Store Address Rate” indicates that loads are frequently blocked by preceding stores with unknown address and implies performance penalty.

  1. Loads Blocked by Unknown Store Data Rate: LOAD_BLOCK.STD / CPU_CLK_UNHALTED.CORE

A high value for “Loads Blocked by Unknown Store Data Rate” indicates that loads are frequently blocked by preceding stores with unknown data and implies performance penalty.

These last two counters would appear to count successful store forwarding, but only in cases where the load actually had to wait after detecting the (possible) overlap.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
2

There is no documented event to count the number of successful store forwarding operations. However, I have experimentally determined a set of undocumented events for that purpose on Haswell and Broadwell. In particular, any event with event code 0x2 and an odd value for umask (any odd number such as 1) seems to be representing the event of successful store forwarding very accurately, i.e., the counts are as expected and the standard deviation is practically zero. I think you can use the same events on later (and even earlier) microarchitectures. Again, none of these events are documented.

Hadi Brais
  • 18,864
  • 3
  • 43
  • 78
  • Very interesting find. May I ask how you found it? – BeeOnRope Nov 15 '18 at 22:38
  • 1
    @BeeOnRope Intel has documented this event (somewhat) for Bonnell and specified a umask 0x81. Bonnell was released in 2008. As long as the same event has not been reused, there is a good chance that it is still there in all later microarchitectures, but not documented. John McCalpin explained in one of his posts why Intel may remove some events from the documentation even though they are still implemented, but I can't find the link :/ Anyway, Any odd value for umask seems to be working. Even values don't seem to be counting anything. – Hadi Brais Nov 15 '18 at 22:52
  • 1
    Bonnell is the last generation of in-order Atom (https://en.wikipedia.org/wiki/Bonnell_(microarchitecture)) before Silvermont. It's not a direct ancestor of Sandybridge-family CPUs like Haswell. Store-forwarding on Atom has *major* differences (1c latency, and only partial overlap doesn't defeat it, e.g. for a dword store / qword reload). If the same counter is implemented on Haswell, it was probably added separately. – Peter Cordes Nov 16 '18 at 01:47
  • I wonder if there's some condition under which it's not reliable; and that's maybe why they chose not to document it. – Peter Cordes Nov 16 '18 at 01:48
  • @PeterCordes I did not extensively test it, so I don't know how reliable it really is. But I think that is as close as we can get. – Hadi Brais Nov 16 '18 at 01:49