I don't see anything more than you did for SKL, but older uarches may have more details:
For Core2 (what Intel confusingly calls the Core microarchitecture), the optimization manual documents (in B.7
EVENT RATIOS FOR INTEL CORE MICROARCHITECTURE):
B.7.5.2 4K Aliasing and Store Forwarding Block Detection
- Loads Blocked by Overlapping Store Rate:
LOAD_BLOCK.OVERLAP_STORE/CPU_CLK_UNHALTED.CORE
4K aliasing and store forwarding block are two different scenarios in which loads are
blocked by preceding stores due to different reasons. Both scenarios
are detected by the same event: LOAD_BLOCK.OVERLAP_STORE
. A high value
for “Loads Blocked by Overlapping Store Rate” indicates that either 4K
aliasing or store forwarding block may affect performance
This may count stalled and successful store-forwarding. (And 4k aliasing, so you need to avoid that or subtract it.)
B.7.5.3 Load Block by Preceding Stores
- Loads Blocked by Unknown Store Address
Rate: LOAD_BLOCK.STA / CPU_CLK_UNHALTED.CORE
A high value for “Loads Blocked by Unknown Store
Address Rate” indicates that loads are frequently blocked by preceding
stores with unknown address and implies performance penalty.
- Loads Blocked by Unknown Store Data Rate:
LOAD_BLOCK.STD / CPU_CLK_UNHALTED.CORE
A high value for “Loads Blocked by Unknown Store
Data Rate” indicates that loads are frequently blocked by preceding
stores with unknown data and implies performance penalty.
These last two counters would appear to count successful store forwarding, but only in cases where the load actually had to wait after detecting the (possible) overlap.