1

I have been trying to profile some code that I wrote as a small memory test on my machine and by using perf I noticed:

 Performance counter stats for './MemBenchmark':

            15,980      LLC-loads                                                   
             8,714      LLC-load-misses           #   54.53% of all LL-cache hits   

      10.002878281 seconds time elapsed

The whole idea of the benchmark is 'stress' the memory so in my books the higher I can make the miss rate the better I think.

EDIT: Is there functionality within Perf that will allow a file to be profiled into different sections? e.g. If main() contains three for loops, is it possible to profile each loop individually to see the number of LLC load misses?

Jason
  • 95
  • 8
  • It probably depends upon your hardware – Basile Starynkevitch Apr 25 '18 at 14:48
  • @BasileStarynkevitch What factors would you say depend on my hardware? On my test machine it's got an i3-380M and 6GB RAM – Jason Apr 25 '18 at 14:52
  • It depends on your program's performance with regard to memory access. If your program takes a 100x performance hit when an L1 cache miss occurs, you likely want the hit rate to be very close to 100%, and would go out of your way to keep the caches filled with the right data. If the performance hit is much lower, you can also accept a much lower hit rate. – 9000 Apr 25 '18 at 16:49

1 Answers1

2

Remember that LLC-loads only counts loads that missed in L1d and L2. As a fraction of total loads (L1-dcache-loads), that's probably a very good hit rate for the cache hierarchy overall (thanks to good locality and/or successful prefetch.)

(Your CPU has a 3-level cache, so the Last Level is the shared L3; the L1 and L2 are private per-core caches. On CPU with only 2 levels of cache, the LLC would be L2.)

Only 9k accesses that had to go all the way to DRAM 10 seconds is very very good.

A low LLC hit rate with such a low total LLC-loads tells you that your workload has good locality for most of its accesses, but the accesses that do miss often have to go all the way to DRAM, and only half of them benefit from having L3 cache at all.

related: Cache friendly offline random read, and see @BeeOnRope's answer on Understanding perf detail when comparing two different implementations of a BFS algorithm where he says the absolute number of LLC misses is what counts for performance.

An algorithm with poor locality will generate a lot of L2 misses, and often a lot of L3 hits (quite possibly with a high L3 hit rate), but also many total L3 misses, so the pipeline is stalled a lot of the time waiting for memory.


What metric could you suggest to measure how my program performs in terms of stressing the memory?

Do you want to know how much total memory traffic your program causes, including prefetches? i.e. what kind of impact it might have on other programs competing for memory bandwidth? offcore_requests.all_requests could tell you how many requests (including L2 prefetches, page walks, and both loads and stores, but not L3 prefetches) make it past L2 to the shared L3 cache, whether or not they hit in shared L3. (Use the ocperf.py wrapper for perf. My Skylake has that event; IDK if your Nehalem will.)

As far as detecting whether your code bottlenecks on memory, LLC-load-misses per second as an absolute measure would be reasonable. Skylake at least has a cycle_activity.stalls_l3_miss to count cycles where no uops executed and there was an outstanding L3 miss. If that's more than a couple % of total cycles, you'd want to look into avoiding those stalls.

(I haven't tried using these events to learn anything myself, they might not be the most useful suggestion. It's hard to know the right question to ask yourself when profiling; there are lots of events you could look at but using them to learn something that helps you figure out how to change your code is hard. It helps a lot to have a good mental picture of how your code uses memory, so you know what to look for. For such a general question, it's hard to say much.)

Is there a way you could suggest that can break down the benchmark file to see which loops are causing the most stress?

You can use perf record -e whatever / perf report -Mintel to do statistical sample-based profiling for any event you want, to see where the hotspots are.

But for cache misses, sometimes the blame lies with some code that looped over an array and evicted lots of valuable data, not the code touching the valuable data that would still be hot.

A loop over a big array might not see many cache misses itself if hardware prefetching does its job.

linux perf: how to interpret and find hotspots. It can be very useful to use stack sampling if you don't know exactly what's slow and fast in your program. Sampling the call stack on each event will show you which function call high up in the call tree is to blame for all the work its callees are doing. Avoiding that call in the first place can be much better than speeding up the functions it calls by a bit.

(Avoid work instead of just doing the same work with better brute force. Careful applications of the maximum brute-force a modern CPU can bring to bear with AVX2 is useful after you've established that you can't avoid doing it in the first place.)

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • Peter, thanks very much for the detailed answer. I must however ask you one last question. I have since ran this on a different machine with the same metrics from the LLC only now I'm getting less than 15% of all LL-cache hits. Would you consider that to be a lot worse than my previous test? I think I'm confused as to whether lower is better or worse in terms of locality – Jason Apr 26 '18 at 10:26
  • 1
    @Jason: Yes that's a lot lower, but on a metric that isn't important for the performance of your program because the absolute rate is still low. I assume the 2nd machine has a smaller LLC but the same size L1/L2. Your program has great locality for most of its accesses (which never make it to LLC at all), and bad locality for a few more, so when those do make it to LLC, they usually miss. – Peter Cordes Apr 26 '18 at 16:24
  • What metric could you suggest to measure how my program performs in terms of stressing the memory? Is there a way you could suggest that can break down the benchmark file to see which loops are causing the most stress? Thanks! – Jason Apr 26 '18 at 18:44
  • 1
    @Jason: edited my answer with replies to your comment. *Definitely* go read [linux perf: how to interpret and find hotspots](//stackoverflow.com/q/7031210) – Peter Cordes Apr 26 '18 at 20:08