Understanding perf detail when comparing two different implementations of a BFS algorithm

Question

The results below are measured using perf on a compute server with 32 cores. I know my implementation is unoptimized but purposely as I want to make comparisons. I understand that graph algorithms tend to have low locality which researchers try to address.

I'm unclear of the results, though. The time elapsed is misleading. My implementation runs through a graph with about 4mm nodes in about 10 seconds and the rest of the time pre processing. The optimized version uses the same input and traverses about 10 times with each less than a second each so it's really just pre-processing time. I'm not trying to achieve the same. Just understand why that may be based on perf.

I see my page faults are substantially higher. I'm not 100 sure why this is this the case as the annotations (from what I can tell) do not point to any specific piece of my code from mine...

__gnu_cxx::new_allocator<std::_List_node<int> >::construct<int, int const&>

This seems to be when I process the graph itself since I create linked lists for the adjacency lists. I figured this may actually cause issues and wanted to investigate anyway. I should be able to improve page faults (and hopefully performance) by switching to jagged arrays?

The optimized algorithm has a much higher last level cache miss which I thought would explain the primary issue with BFS / graph algorithms with low locality but performance seems to be unaffected by this and my unoptimized is significantly lower.

Then there are the front / back end cycles which seems to be the opposite in terms of performance issues when comparing the two - I'm worse in frontend and the optimized is worse in backend.

Am I missing or not understanding something obvious? I thought there would be something obvious in terms of low locality that would be of issue when looking at perf but I'm confused by the optimized version.

This is my implementation of unoptimized parallel BFS (running once)...

This is using an optimized parallel BFS from a benchmark suite (running 10 times)...

Both take about 40 seconds to pre-process the data once, before doing parallel searching.

your branch-misses count percentage is only half, and you have fewer total branches (so only 1/4 of the total number of branc mispredicts). Your version was only slightly slower (54.4s vs. 52.5s) with less CPU usage (task-clock) over fewer average CPUs. Also fewer L3 cache misses. (But more L3 cache accesses, so more L2 misses). — Peter Cordes, Apr 18 '18 at 20:02
The optimized is running the bfs pass over 10 times while I'm just doing it once. The elapsed time is misleading. We both take about 40 seconds to pre process but the remaining time is used to do the BFS pass where mine is just once. optimized does each pass in less than a second — pad11, Apr 18 '18 at 20:58
Ah I see you what mean. Many more hits to last level cache. Then that's likely the reason? Would it be because of using linked lists for adjacency lists? — pad11, Apr 18 '18 at 21:22
If the way you allocate them doesn't make them sequential in memory, HW prefetch will do a much worse job and could explain more L1/L2 cache misses. Walking a linked list makes the load-use latency into a loop-carried dependency chain, and limits memory parallelism. Also, make sure you're using hugepages for any big chunks of memory; that might explain the much lower pagefaults. — Peter Cordes, Apr 18 '18 at 21:29
I'd suggest `perf record` for some interesting events like L2 misses or LLC-loads, and use `perf report` to see if you can figure out which loads are causing the cache misses. Or which loads are high-latency using the `mem_trans_retired.load_latency_gt_32` counter, or something. (`ocperf.py` is a nice wrapper for `perf` giving symbolic event names for more microarch-specific events on Intel CPUs.) BTW, have a look at [this answer](https://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343), and think about the high level before you get bogged down in the low level. — Peter Cordes, Apr 18 '18 at 21:31
Also related: [linux perf: how to interpret and find hotspots](//stackoverflow.com/q/7031210) — Peter Cordes, Apr 18 '18 at 21:34
Consider doing the pre-processing separately, so you can profile *just* the search part. Profiling the pre-processing *and* the searching together gives you a weaker signal. Either use `perf` libraries to only start profiling part way into your program, or dump the preprocessed data to a file which you can `mmap(MAP_POPULATE)` (or read into a buffer that uses 2M-aligned HUGETLB pages). Anyway, split the parallel search part into a separate process so you can `perf stat` or `perf record` just it. — Peter Cordes, Apr 18 '18 at 21:40
You need to cut the pre-processing part out of your profiling since you'll never be able to compare them easily without that. You can use the `--delay msecs` option on `perf` to do that. Let's say the pre-processing takes 10 seconds, use `--delay 11000` and put a two second `sleep` after pre-processing, and then `perf` should always start during the sleep and just record the actual processing. The sleep itself uses no CPU and so doesn't affect the results. Consider running both the same number of times since there are also "first run" effects that will throw you off. — BeeOnRope, Apr 18 '18 at 23:05

score 2 · Accepted Answer · answered Apr 19 '18 at 00:26

Unfortunately perf stat often doesn't given enough information to really determine where the bottleneck in your application is. It is possible to have two applications with wildly different underlying bottlenecks but with very similar perf stat profiles. For example, two applications may have the same number or fraction of L2 cache misses, and yet one might be dominated by this effect and the other way may almost be not impacted at all, depending on the amount and nature of overlapping work.

So if you try to analyze in depth from these high level counters, you are often just taking stabs in the dark. Still we can make a few observations. You mention:

The optimized algorithm has a much higher last level cache miss which I thought would explain the primary issue with BFS / graph algorithms with low locality but performance seems to be unaffected by this and my unoptimized is significantly lower.

First, LLC misses are ~620 million for the optimized algorithm and ~380 for your algorithm, but you are running the optimized algorithm 10 times in this benchmark and yours only once. So the optimized algorithm has perhaps 62 million misses, and your algorithm has six times the number of LLC misses. Yes, your algorithm has a lower LLC miss rate - but the absolute number of LLC misses is what counts for performance. The lower miss rates just means that you are making every more total accesses than the 6x figure: basically you make many, many more memory accesses than the optimized version, which leads to a higher hit rate but more total misses.

All of this points to accessing more total memory in your unoptimized algorithm, or perhaps accessing it in a much more cache unfriendly fashion. That's would also explain the much higher number of page faults. Overall, both algorithms have low IPC, and yours is particularly low (0.49 IPC) and given that there aren't branch prediction problems, and that you've already identified these as graph algorithms with locality/memory access problems, stalls while waiting for memory are very likely.

Luckily, there is a better way that just trying to reverse engineer what might be the bottleneck based on perf stat output. Intel has developed a whole methodology which tries to this type of top-down analysis in a way that determines the true bottlenecks. It's not perfect, but it's far and away better than looking at the plain perf stat counters. VTune isn't free, but you can get a similar analysis based on the same methodology effect using Andi Kleen's toplev. I highly recommend you start there.

Right, I wasn't considering the fact that the optimized version is running multiple times so my miss rates are much worse in comparsion. Thanks! While I wasn't hoping to use VTune and started on it, unfortunately I'm unable to use it. I may play around with it this summer locally though. Thanks again! — pad11, Apr 23 '18 at 17:10
@pad11 - you really don't need VTune specifically - `perf record` and `perf report` combined with `top-level.py` do much the same thing. — BeeOnRope, Apr 23 '18 at 19:41
ah ok - good to know. it's a little late now but I'll have a look out of interest later. thank you! — pad11, Apr 25 '18 at 04:47

Understanding perf detail when comparing two different implementations of a BFS algorithm

1 Answers1

Linked