21

Though I can intuitively get most of the results, I'm having hard time fully understanding the output of the perf report command especially for what concerns the call graph, so I wrote a stupid test to solve this issue of mine once for all.

The stupid test

I compiled what follows with:

gcc -Wall -pedantic -lm perf-test.c -o perf-test

No aggressive optimizations to avoid inlining and such.

#include <math.h>

#define N 10000000UL

#define USELESSNESS(n)                          \
    do {                                        \
        unsigned long i;                        \
        double x = 42;                          \
        for (i = 0; i < (n); i++) x = sin(x);   \
    } while (0)

void baz()
{
    USELESSNESS(N);
}

void bar()
{
    USELESSNESS(2 * N);
    baz();
}

void foo()
{
    USELESSNESS(3 * N);
    bar();
    baz();
}

int main()
{
    foo();
    return 0;
}

Flat profiling

perf record ./perf-test
perf report

With these I get:

  94,44%  perf-test  libm-2.19.so       [.] __sin_sse2
   2,09%  perf-test  perf-test          [.] sin@plt
   1,24%  perf-test  perf-test          [.] foo
   0,85%  perf-test  perf-test          [.] baz
   0,83%  perf-test  perf-test          [.] bar

Which sounds reasonable since the heavy work is actually performed by __sin_sse2 and sin@plt is probably just a wrapper, while the overhead of my functions take into account just the loop, overall: 3*N iterations for foo, 2*N for the other two.

Hierarchical profiling

perf record -g ./perf-test
perf report -G
perf report

Now the overhead columns that I get are two: Children (the output is sorted by this one by default) and Self (the same overhead of the flat profile).

Here is where I start feeling I miss something: regardless of the fact that I use -G or not I'm unable to explain the hierarchy in terms of "x calls y" or "y is called by x", for example:

  • without -G ("y is called by x"):

    -   94,34%    94,06%  perf-test  libm-2.19.so       [.] __sin_sse2
       - __sin_sse2
          + 43,67% foo
          + 41,45% main
          + 14,88% bar
    -   37,73%     0,00%  perf-test  perf-test          [.] main
         main
         __libc_start_main
    -   23,41%     1,35%  perf-test  perf-test          [.] foo
         foo
         main
         __libc_start_main
    -    6,43%     0,83%  perf-test  perf-test          [.] bar
         bar
         foo
         main
         __libc_start_main
    -    0,98%     0,98%  perf-test  perf-test          [.] baz
       - baz
          + 54,71% foo
          + 45,29% bar
    
    1. Why __sin_sse2 is called by main (indirectly?), foo and bar but not by baz?
    2. Why functions sometimes have a percent and a hierarchy attached (e.g., the last instance of baz) and sometimes not (e.g., the last instance of bar)?
  • with -G ("x calls y"):

    -   94,34%    94,06%  perf-test  libm-2.19.so       [.] __sin_sse2
       + __sin_sse2
       + __libc_start_main
       + main
    -   37,73%     0,00%  perf-test  perf-test          [.] main
       - main
          + 62,05% foo
          + 35,73% __sin_sse2
            2,23% sin@plt
    -   23,41%     1,35%  perf-test  perf-test          [.] foo
       - foo
          + 64,40% __sin_sse2
          + 29,18% bar
          + 3,98% sin@plt
            2,44% baz
         __libc_start_main
         main
         foo
    
    1. How should I interpret the first three entries under __sin_sse2?
    2. main calls foo and that's ok, but why if it calls __sin_sse2 and sin@plt (indirectly?) it does not also call bar and baz?
    3. Why do __libc_start_main and main appear under foo? And why foo appears twice?

Suspect is that there are two levels of this hierarchy, in which the second actually represents the "x calls y"/"y is called by x" semantics, but I'm tired to guess so I'm asking here. And the documentation doesn't seem to help.


Sorry for the long post but I hope that all this context may help or act as a reference for someone else too.

cYrus
  • 2,713
  • 5
  • 24
  • 44
  • 1
    I'm not an expert on `perf`, but I do know that by default it looks at the stack ~1000 times per second to collect its data. Therefore fine-grained analysis as you're attempting is likely to fail. So e.g. it's possible that none of the samples occurred when `sin_sse2` was being called by `baz`. Consider using `gprof`, which compiles in stubs to catch every call and return (though it has other problems). – Gene Jan 02 '15 at 16:34
  • Yes, I know but it's fast and it let me record all kinds of crazy events such as cache misses and branch misprediction per symbol, whereas AFAIK `gprof` can't do it. I used `N` quite big precisely to avoid what you mentioned; anyway I tried increasing it even further with no luck, I don't know, but I think that it's quite unlikely that no samples are collected even in a 100M iteration loop. – cYrus Jan 02 '15 at 17:35

1 Answers1

11

Alright, well, let's ignore temporarily the difference between caller and callee call-graphs, mostly because when I compare the results between these two options on my machine, I only see effects inside the kernel.kallsyms DSO for reasons I don't understand -- relatively new to this myself.

I found that for your example, it's a little easier to read the whole tree. So, using --stdio, let's look at the whole tree for __sin_sse2:

# Overhead    Command      Shared Object                  Symbol
# ........  .........  .................  ......................
#
    94.72%  perf-test  libm-2.19.so       [.] __sin_sse2
            |
            --- __sin_sse2
               |
               |--44.20%-- foo
               |          |
               |           --100.00%-- main
               |                     __libc_start_main
               |                     _start
               |                     0x0
               |
               |--27.95%-- baz
               |          |
               |          |--51.78%-- bar
               |          |          foo
               |          |          main
               |          |          __libc_start_main
               |          |          _start
               |          |          0x0
               |          |
               |           --48.22%-- foo
               |                     main
               |                     __libc_start_main
               |                     _start
               |                     0x0
               |
                --27.84%-- bar
                          |
                           --100.00%-- foo
                                     main
                                     __libc_start_main
                                     _start
                                     0x0

So, the way I read this is: 44% of the time, sin is called from foo; 27% of the time it's called from baz, and 27% from bar.

The documentation for -g is instructive:

 -g [type,min[,limit],order[,key]], --call-graph
       Display call chains using type, min percent threshold, optional print limit and order. type can be either:

       ·   flat: single column, linear exposure of call chains.

       ·   graph: use a graph tree, displaying absolute overhead rates.

       ·   fractal: like graph, but displays relative rates. Each branch of the tree is considered as a new profiled object.

               order can be either:
               - callee: callee based call graph.
               - caller: inverted caller based call graph.

               key can be:
               - function: compare on functions
               - address: compare on individual code addresses

               Default: fractal,0.5,callee,function.

The important piece here is that the default is fractal, and in fractal mode, each branch is a new object.

So, you can see that 50% of the time that baz is called, it's called from bar, and the other 50% it's called from foo.

This isn't always the most useful measure, so it's instructive to look at the results using -g graph:

94.72%  perf-test  libm-2.19.so       [.] __sin_sse2
        |
        --- __sin_sse2
           |
           |--41.87%-- foo
           |          |
           |           --41.48%-- main
           |                     __libc_start_main
           |                     _start
           |                     0x0
           |
           |--26.48%-- baz
           |          |
           |          |--13.50%-- bar
           |          |          foo
           |          |          main
           |          |          __libc_start_main
           |          |          _start
           |          |          0x0
           |          |
           |           --12.57%-- foo
           |                     main
           |                     __libc_start_main
           |                     _start
           |                     0x0
           |
            --26.38%-- bar
                      |
                       --26.17%-- foo
                                 main
                                 __libc_start_main
                                 _start
                                 0x0

This changes to using absolute percentages, where each percentage of time is reported for that call chain: So foo->bar is 26% of the total ticks (which in turn calls baz), and foo->baz (direct) is 12% of the total ticks.

I still have no idea why I don't see any differences between callee and caller graphs though, from the perspective of __sin_sse2.

Update

One thing I did change from your command line is how the callgraphs were gathered. Linux perf by default uses the frame pointer method of reconstructing callstacks. This can be a problem when the compiler uses -fomit-frame-pointer as a default. So I used

perf record --call-graph dwarf ./perf-test
Community
  • 1
  • 1
Matthew G.
  • 1,178
  • 9
  • 23
  • The relative vs absolute percentages thing absolutely makes sense. The main point is that I'm just unable to came up with an output which is similar to yours (which seems legit since `__sin_sse2` appears to be called *directly* by all the three functions, while mine features `foo`, `main` and `bar` only). Can you please post the exact flags that you used? – cYrus Feb 04 '15 at 16:38
  • @cYrus take a peek at my update... Will post exact command line for `perf report` after lunch. – Matthew G. Feb 04 '15 at 17:22
  • 1
    @MatthewG. well, it looks like it's all about using `dwarf`... I'll check if that solves all of my listed issues. – cYrus Feb 04 '15 at 17:34
  • @cYrus I copy and pasted your command line for compilation (though, had to move the linker flag to the end). Kernel is 3.13.0-37-generic (which *may* matter). The family of command lines for reports was `perf report -g {graph,fractal},0.05,calle{e,r} --stdio > {graph,fracal}Calle{e,r}` – Matthew G. Feb 04 '15 at 17:46
  • Are you sure the percentages reflect the number of invocations and not the cycles count? – Nathan B Aug 27 '18 at 11:32
  • Yeah, that's looseness in the way I wrote. To be more accurate, perf is a sample based profiler, so it'll be a percentage of samples. Feel free to try to edit the bit that makes you confused the most; I'm honestly not sure the right place to put it. – Matthew G. Aug 29 '18 at 20:48
  • Could you please elaborate on sentence `This changes to using absolute percentages, where each percentage of time is reported for that call chain`. I tried adding up percentages from your table both vertically and "horizontally", but I never get a 100%, so I don't understand what this "absolute percentage" term refers to. – Hi-Angel Jan 27 '19 at 19:18