How to profile time spent in memory access in C/C++ applications?

Question

Total Time spent by a function in an application can be broadly divided in to two components:

Time spent on actual computation (Tcomp)
Time spent on memory accesses (Tmem)

Typically profilers provide an estimate of the total time spent by a function. Is it possible to get an estimate of the time spent in terms of the above two components (Tcomp and Tmem)?

There's a very simple way to answer this question. Just do [*random pausing*](https://stackoverflow.com/a/378024/23771) and see what fraction of samples are in memory management. [*Here's an example.*](https://stackoverflow.com/a/927773/23771). As a generality, in large software written by recent graduates, the fraction of time spent in memory management tends to be 50-99%. The good news is a speedup factor of 2-100 only awaits refactoring. — Mike Dunlavey, May 31 '17 at 18:14
Mike, hello, this is not about memory management (malloc/free/new/delete), it is about effective usage of CPU. Random pausing don't help me to find which amount of code is memory-latency limited and which is not limited by memory and runs for full ALU speed. How can I get point in the **roofline model** (https://en.wikipedia.org/wiki/Roofline_model - https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/) for some task like: It uses 30 GFLOPS and 15 GBytes/s of requests to memory hierarchy, 5 GBytes/s are serverd by main RAM; util% of ram is 15%, util% of ALU is 30%,use AVX2 — osgx, May 31 '17 at 18:22
You might want to have a look at [likwid](https://github.com/RRZE-HPC/likwid). — Henri Menke, Jun 05 '17 at 10:44

score 7 · Accepted Answer · answered Nov 15 '16 at 13:18

7

A notion of Arithmetic Intensity has been proposed by the Roofline model: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/. Simply said it defines the number of arithmetic instructions executed for each memory access.

Computing the Arithmetic Intensity is usually implemented through the use of performance counters.

answered Nov 15 '16 at 13:18

Manuel Selva

16,987
21
76
127

Thanks Manuel. This seems to be colser to what i am trying to understand and achieve. I will give it a detailed look. Based on more reading I am trying to have a quantitative estimate of an application being memory/compute bound. – Imran Nov 15 '16 at 14:18
1

The roofline model has been done for that. I strongly recommend the reading of the paper called Applying the roofline model for practical details: http://spiral.ece.cmu.edu:8080/pub-spiral/pubfile/ispass-2013_177.pdf – Manuel Selva Nov 15 '16 at 14:26
Manuel, can you expand most useful parts from the paper and integrate them into your answer? – osgx May 30 '17 at 05:28

ks1322 · Answer 2 · 2017-06-05T10:57:30.733

Brendan Gregg in his recent blog post CPU Utilization is Wrong suggests to use instructions per cycle PMC. In short if IPC is < 1.0 than the app can be considered memory bound. Otherwise it can be considered instruction bound. Here is a relevant excerpt from his post:

If your IPC is < 1.0, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects.

If your IPC is > 1.0, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. CPU flame graphs are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads.

For my above rules, I split on an IPC of 1.0. Where did I get that from? I made it up, based on my prior work with PMCs. Here's how you can get a value that's custom for your system and runtime: write two dummy workloads, one that is CPU bound, and one memory bound. Measure their IPC, then calculate their mid point.

Here are some examples of dummy workloads generated by stress tool and thier IPCs.
Memory bound test, IPC is low (0,02):

$ perf stat stress --vm 4 -t 3
stress: info: [4520] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
stress: info: [4520] successful run completed in 3s

 Performance counter stats for 'stress --vm 4 -t 3':

      10767,074968      task-clock:u (msec)       #    3,560 CPUs utilized          
                 0      context-switches:u        #    0,000 K/sec                  
                 0      cpu-migrations:u          #    0,000 K/sec                  
         4 555 919      page-faults:u             #    0,423 M/sec                  
     4 290 929 426      cycles:u                  #    0,399 GHz                    
        67 779 143      instructions:u            #    0,02  insn per cycle         
        18 074 114      branches:u                #    1,679 M/sec                  
             5 398      branch-misses:u           #    0,03% of all branches        

       3,024851934 seconds time elapsed

CPU bound test, IPC is high (1,44):

$ perf stat stress --cpu 4 -t 3
stress: info: [4465] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
stress: info: [4465] successful run completed in 3s

 Performance counter stats for 'stress --cpu 4 -t 3':

      11419,683671      task-clock:u (msec)       #    3,805 CPUs utilized          
                 0      context-switches:u        #    0,000 K/sec                  
                 0      cpu-migrations:u          #    0,000 K/sec                  
               108      page-faults:u             #    0,009 K/sec                  
    30 562 187 954      cycles:u                  #    2,676 GHz                    
    43 995 290 836      instructions:u            #    1,44  insn per cycle         
    13 043 425 872      branches:u                # 1142,188 M/sec                  
        26 312 747      branch-misses:u           #    0,20% of all branches        

       3,001218526 seconds time elapsed

Thanks! Can you add command or two how to get IPC in Linux (probably using `perf stat`, or with vtune?). Is there any variant of commands which will do periodic printing of IPC, and/or system-wide average of IPC (can't find periodic interval in the man of perf http://man7.org/linux/man-pages/man1/perf-stat.1.html to print it every 1 or 5 [second like `vmstat` and `iostat` and `sar -n DEV` did](https://medium.com/netflix-techblog/linux-performance-analysis-in-60-000-milliseconds-accc10403c55); what is tiptop)? Any practical example of memory-bound / latency-bound program and its IPC? — osgx, Jun 03 '17 at 22:18
Why IPC of 1 is the border between mem/cpu bound? I think that IPC is counted in perf by x86 instructions, but many of us know that in group of 3 or 4 instruction to be decoded in clock cycle, first may generate many microoperations (uops) and others up to 1; and the wide execution engine of CPU may start to execute up to 6 or 8 uops... Hmm, periodic print is pmcarch - https://github.com/brendangregg/pmc-cloud-tools/blob/master/pmcarch (Linux on Intel only); tiptop is http://tiptop.gforge.inria.fr — osgx, Jun 03 '17 at 22:21
I think that big matrix multiplication will be memory-bound because a lot of data must be read from memory. Calculating pi would be CPU bound since there is no need to read much data from memory. IPC of 1 was chosen after measuring 2 dummy workloads. It can be another value for your system, this depends on measurements. — ks1322, Jun 04 '17 at 17:22
ks1322, DGEMM - matrix multiply is BLAS3 and has low memory bandwidth required: there are O(N^2) memory accesses and O(N^3) cpu operations (for bigger matrix sizes). BLAS3 is marked in https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/ at Arithmetic Intensity scale as Dense Linear algebra group ~tens of fpu op per memory byte; also check http://spiral.ece.cmu.edu:8080/pub-spiral/pubfile/ispass-2013_177.pdf — osgx, Jun 04 '17 at 22:03

score 6 · Answer 3 · answered Nov 15 '16 at 09:21

6

It is not possible to measure this (and it does not make any sense to do that), because the computation is overlapped with the memory accesses in the current processor architectures. Also the accessing memory is usually broken down to more steps (accessing the memory, pre-fetching to the various cache levels, actual reading to the processor registers).

You can measure cache hits and misses on various cache levels to estimate the efficiency of your algorithm on your hardware using perf and its hardware counters (if supported by your hardware).

answered Nov 15 '16 at 09:21

Jakuje

20,643
11
53
62

It may not make sense when you are optimzing the performance of your application on the architecture you are running. You are right as then cache misses/hits are helpful and a lot of tools provide this information. But, IMHO, it makes sense when you need archtecture independent profile of application. This can be useful when you are estimating the performance of an application on an emerging architecture. This way you can separately quantify the effect of improving computation and memory access. In other words, it will indicate what should be the focus of improvement in the new architecture. – Imran Nov 15 '16 at 10:03
2

For simplicity, i can consider memory access time as the time which is not spent in actual computation. – Imran Nov 15 '16 at 10:03

score -1 · Answer 4 · answered May 31 '17 at 16:41

-1

if you are looking for a function to get CPU cycle, then boost will be very helpful. I have used Boost Timer Utility to calculate cpu cycles on a system call.

On other hand you can put same function on complete program to get overall time.

I hope this is what you are looking for. -Vijay

answered May 31 '17 at 16:41

vijay sharma

111
11

vijay, I'm looking not for timing some functions in seconds (both in C and C++ applications), but for methods to find Arithmetic Intensity - how many ALU instructions are done compared to amount of memory request instructions. Some tasks are memory bandwidth or memory latency bound, Other are ALU bound. Single kind of measurement (real time measurement with Boost Timer) will not help. Hardware performance counters (PMC) should be used, and the question is which PMC to use on Intel/AMD CPUs. – osgx May 31 '17 at 16:51
@osgx: I commented on the OP question. – Mike Dunlavey May 31 '17 at 18:17
Mike, You did not comment the OP question (it was about **memory access time**, not about time share of **memory management**), you are keep telling other people (4 mln of them) what to do / what to learn. Why didn't you get https://stackoverflow.com/help/badges/262/publicist badge for always linking to your own the only true solution of random stack sampling to any problem which has performance word in description? – osgx May 31 '17 at 18:25

How to profile time spent in memory access in C/C++ applications?

4 Answers4