Brendan Gregg in his recent blog post CPU Utilization is Wrong suggests to use instructions per cycle PMC. In short if IPC is < 1.0 than the app can be considered memory bound. Otherwise it can be considered instruction bound. Here is a relevant excerpt from his post:
If your IPC is < 1.0, you are likely memory stalled, and software
tuning strategies include reducing memory I/O, and improving CPU
caching and memory locality, especially on NUMA systems. Hardware
tuning includes using processors with larger CPU caches, and faster
memory, busses, and interconnects.
If your IPC is > 1.0, you are likely instruction bound. Look for ways
to reduce code execution: eliminate unnecessary work, cache
operations, etc. CPU flame graphs are a great tool for this
investigation. For hardware tuning, try a faster clock rate, and more
cores/hyperthreads.
For my above rules, I split on an IPC of 1.0. Where did I get that
from? I made it up, based on my prior work with PMCs. Here's how you
can get a value that's custom for your system and runtime: write two
dummy workloads, one that is CPU bound, and one memory bound. Measure
their IPC, then calculate their mid point.
Here are some examples of dummy workloads generated by stress tool and thier IPCs.
Memory bound test, IPC is low (0,02):
$ perf stat stress --vm 4 -t 3
stress: info: [4520] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
stress: info: [4520] successful run completed in 3s
Performance counter stats for 'stress --vm 4 -t 3':
10767,074968 task-clock:u (msec) # 3,560 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
4 555 919 page-faults:u # 0,423 M/sec
4 290 929 426 cycles:u # 0,399 GHz
67 779 143 instructions:u # 0,02 insn per cycle
18 074 114 branches:u # 1,679 M/sec
5 398 branch-misses:u # 0,03% of all branches
3,024851934 seconds time elapsed
CPU bound test, IPC is high (1,44):
$ perf stat stress --cpu 4 -t 3
stress: info: [4465] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
stress: info: [4465] successful run completed in 3s
Performance counter stats for 'stress --cpu 4 -t 3':
11419,683671 task-clock:u (msec) # 3,805 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
108 page-faults:u # 0,009 K/sec
30 562 187 954 cycles:u # 2,676 GHz
43 995 290 836 instructions:u # 1,44 insn per cycle
13 043 425 872 branches:u # 1142,188 M/sec
26 312 747 branch-misses:u # 0,20% of all branches
3,001218526 seconds time elapsed