169

What other programs do the same thing as gprof?

Good Person
  • 1,427
  • 2
  • 22
  • 40
neuromancer
  • 47,047
  • 74
  • 161
  • 217

7 Answers7

200

gprof (read the paper) exists for historical reasons. If you think it will help you find performance problems, it was never advertised as such. Here's what the paper says:

The profile can be used to compare and assess the costs of various implementations.

It does not say it can be used to identify the various implementations to be assessed, although it does imply that it could, under special circumstances:

especially if small portions of the program are found to dominate its execution time.

What about problems that are not so localized? Do those not matter? Don't place expectations on gprof that were never claimed for it. It is only a measurement tool, and only of CPU-bound operations.

Try this instead.
Here's an example of a 44x speedup.
Here's a 730x speedup.
Here's an 8-minute video demonstration.
Here's an explanation of the statistics.
Here's an answer to critiques.

There's a simple observation about programs. In a given execution, every instruction is responsible for some fraction of the overall time (especially call instructions), in the sense that if it were not there, the time would not be spent. During that time, the instruction is on the stack **. When that is understood, you can see that -

gprof embodies certain myths about performance, such as:

  1. that program counter sampling is useful.
    It is only useful if you have an unnecessary hotspot bottleneck such as a bubble sort of a big array of scalar values. As soon as you, for example, change it into a sort using string-compare, it is still a bottleneck, but program counter sampling will not see it because now the hotspot is in string-compare. On the other hand if it were to sample the extended program counter (the call stack), the point at which the string-compare is called, the sort loop, is clearly displayed. In fact, gprof was an attempt to remedy the limitations of pc-only sampling.

  2. that timing functions is more important than capturing time-consuming lines of code.
    The reason for that myth is that gprof was not able to capture stack samples, so instead it times functions, counts their invocations, and tries to capture the call graph. However, once a costly function is identified, you still need to look inside it for the lines that are responsible for the time. If there were stack samples you would not need to look, those lines would be on the samples. (A typical function might have 100 - 1000 instructions. A function call is 1 instruction, so something that locates costly calls is 2-3 orders of magnitude more precise.)

  3. that the call graph is important.
    What you need to know about a program is not where it spends its time, but why. When it is spending time in a function, every line of code on the stack gives one link in the chain of reasoning of why it is there. If you can only see part of the stack, you can only see part of the reason why, so you can't tell for sure if that time is actually necessary. What does the call graph tell you? Each arc tells you that some function A was in the process of calling some function B for some fraction of the time. Even if A has only one such line of code calling B, that line only gives a small part of the reason why. If you are lucky enough, maybe that line has a poor reason. Usually, you need to see multiple simultaneous lines to find a poor reason if it is there. If A calls B in more than one place, then it tells you even less.

  4. that recursion is a tricky confusing issue.
    That is only because gprof and other profilers perceive a need to generate a call-graph and then attribute times to the nodes. If one has samples of the stack, the time-cost of each line of code that appears on samples is a very simple number - the fraction of samples it is on. If there is recursion, then a given line can appear more than once on a sample. No matter. Suppose samples are taken every N ms, and the line appears on F% of them (singly or not). If that line can be made to take no time (such as by deleting it or branching around it), then those samples would disappear, and the time would be reduced by F%.

  5. that accuracy of time measurement (and therefore a large number of samples) is important.
    Think about it for a second. If a line of code is on 3 samples out of five, then if you could shoot it out like a light bulb, that is roughly 60% less time that would be used. Now, you know that if you had taken a different 5 samples, you might have only seen it 2 times, or as many as 4. So that 60% measurement is more like a general range from 40% to 80%. If it were only 40%, would you say the problem is not worth fixing? So what's the point of time accuracy, when what you really want is to find the problems? 500 or 5000 samples would have measured the problem with greater precision, but would not have found it any more accurately.

  6. that counting of statement or function invocations is useful.
    Suppose you know a function has been called 1000 times. Can you tell from that what fraction of time it costs? You also need to know how long it takes to run, on average, multiply it by the count, and divide by the total time. The average invocation time could vary from nanoseconds to seconds, so the count alone doesn't tell much. If there are stack samples, the cost of a routine or of any statement is just the fraction of samples it is on. That fraction of time is what could in principle be saved overall if the routine or statement could be made to take no time, so that is what has the most direct relationship to performance.

  7. that samples need not be taken when blocked
    The reasons for this myth are twofold: 1) that PC sampling is meaningless when the program is waiting, and 2) the preoccupation with accuracy of timing. However, for (1) the program may very well be waiting for something that it asked for, such as file I/O, which you need to know, and which stack samples reveal. (Obviously you want to exclude samples while waiting for user input.) For (2) if the program is waiting simply because of competition with other processes, that presumably happens in a fairly random way while it's running. So while the program may be taking longer, that will not have a large effect on the statistic that matters, the percentage of time that statements are on the stack.

  8. that "self time" matters
    Self time only makes sense if you are measuring at the function level, not line level, and you think you need help in discerning if the function time goes into purely local computation versus in called routines. If summarizing at the line level, a line represents self time if it is at the end of the stack, otherwise it represents inclusive time. Either way, what it costs is the percentage of stack samples it is on, so that locates it for you in either case.

  9. that samples have to be taken at high frequency
    This comes from the idea that a performance problem may be fast-acting, and that samples have to be frequent in order to hit it. But, if the problem is costing, 20%, say, out of a total running time of 10 sec (or whatever), then each sample in that total time will have a 20% chance of hitting it, no matter if the problem occurs in a single piece like this
    .....XXXXXXXX...........................
    .^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^ (20 samples, 4 hits)
    or in many small pieces like this
    X...X...X.X..X.........X.....X....X.....
    .^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^ (20 samples, 3 hits)
    Either way, the number of hits will average about 1 in 5, no matter how many samples are taken, or how few. (Average = 20 * 0.2 = 4. Standard deviation = +/- sqrt(20 * 0.2 * 0.8) = 1.8.)

  10. that you are trying to find the bottleneck
    as if there were only one. Consider the following execution timeline: vxvWvzvWvxvWvYvWvxvWv.vWvxvWvYvW
    It consists of real useful work, represented by .. There are performance problems vWxYz taking 1/2, 1/4, 1/8, 1/16, 1/32 of the time, respectively. Sampling finds v easily. It is removed, leaving
    xWzWxWYWxW.WxWYW
    Now the program takes half as long to run, and now W takes half the time, and is found easily. It is removed, leaving
    xzxYx.xY
    This process continues, each time removing the biggest, by percentage, performance problem, until nothing to remove can be found. Now the only thing executed is ., which executes in 1/32 of the time used by the original program. This is the magnification effect, by which removing any problem makes the remainder larger, by percent, because the denominator is reduced.
    Another crucial point is that every single problem must be found - missing none of the 5. Any problem not found and fixed severely reduces the final speedup ratio. Just finding some, but not all, is not "good enough".

ADDED: I would just like to point out one reason why gprof is popular - it is being taught, presumably because it's free, easy to teach, and it's been around a long time. A quick Google search locates some academic institutions that teach it (or appear to):

berkeley bu clemson colorado duke earlham fsu indiana mit msu ncsa.illinois ncsu nyu ou princeton psu stanford ucsd umd umich utah utexas utk wustl

** With the exception of other ways of requesting work to be done, that don't leave a trace telling why, such as by message posting.

Community
  • 1
  • 1
Mike Dunlavey
  • 38,662
  • 12
  • 86
  • 126
  • Mike, have you thought about modifying valgrind to offer your stack profiles? I'd love to try this out in class. – Norman Ramsey Dec 19 '09 at 05:27
  • 3
    @Norman: I made a profiler based on this, in C for DOS, around '93. I called it yet-another-performance-analyzer, and demo'ed it around at IEEE meetings, but that's as far as it went. There is a product from RotateRight called Zoom that's not too far off. On *nix, __pstack__ is good for doing it manually. My to-do list for work (pharmacometrics on Windows) is about a mile long which precludes fun projects, not to mention family. This might be useful: http://stackoverflow.com/questions/1777669/how-to-modify-a-c-program-so-that-gprof-can-profile-it/1931042#1931042 – Mike Dunlavey Dec 19 '09 at 13:16
  • 6
    I've always found profilers not so useful for fixing slow code, and instead used selective bits of debugging code to measure the time taken by a group of statements of my choosing, often aided by some trivial little macros or whatever. It's never taken me too long to find the culprit, but I've always been embarrassed of my "bear skins and stone knives" approach when "everyone else" (as far as I know) uses the fancy tools. Thank you for showing me why I could never get the information I needed from profiler. This is one of the most important ideas I've seen on SO. Well done! – Wayne Conrad Feb 11 '10 at 02:44
  • @Wayne-Conrad: Thx. Actually there are profilers that sample the call stack, on wall-clock time, suppressing sampling during user-wait, and that summarize by % at the line / instruction level, such as RotateRight/Zoom. Oprofile can be made to do that too. Personally, I just rely on the pause button, because sometimes even the call stack is insufficient state information to say **why** something is happening. – Mike Dunlavey Feb 14 '10 at 20:46
  • Hey, please, don't rip gprof. Gprof is one of the most portable profilers and it works on platforms, where there is almost (or totally) no any other way of profiling programms. – osgx Mar 24 '10 at 22:08
  • 7
    @osgx: I don't mean to rip anything. It's like an old favorite automobile, simple & rugged, but there are things it doesn't do, and we need to be aware of those, and not only that, we need to wake up from the myths. I appreciate that on some platforms it may be difficult to get stack samples, but if a problem is such that gprof will not find it, the fact that it is the only tool is small comfort. – Mike Dunlavey Mar 25 '10 at 00:30
  • Can you expand on point 3 a bit? – Andrew Grimm Jul 28 '10 at 03:51
  • @Andrew: Sure. Forget the difficulty of putting times on a graph, especially with recursion (point 4). Forget that call graphs are function (not line) based (point 2). Forget that call graphs are about time, not location (point 2). The need is to locate the lines of code that call for time to be spent for not-really-necessary reasons. So if a stack sample shows a series of lines of code, the reason for that nanosecond being spent can be "read off" the sample. Then if the reason is one that isn't really needed ... – Mike Dunlavey Jul 28 '10 at 12:09
  • 2
    @Andrew: ... *and* if that reason applies to some significant fraction of samples (like more than 1), then the line(s) of code that could eliminate that activity are on those samples. A graph can give you a *hint* of this, but a not-large number of stack samples will simply show them to you. – Mike Dunlavey Jul 28 '10 at 12:12
  • @Andrew: I'm sorry to belabor, but this effect is unbelievably pronounced in the big C# app I work in, with around 10^6 lines of code. No-one has ever made the graph, it would look like a rat's nest. But the call stack tends to be really deep (like 30) so when we want to speed things up, the hunting is really good in all those calls. – Mike Dunlavey Jul 28 '10 at 12:45
  • @Andrew: If you want to get academic about it, the argument runs two ways. If there is a problem of a certain fractional size F, then out of N stackshots, the number n that expose it is a simple binomial distribution. In the other direction, if n/N stackshots show a problem, then its size F follows a beta distribution. – Mike Dunlavey Jul 28 '10 at 18:18
  • This answer is well and good, but it's not a gprof alternative. Can you suggest something that *does* do stack sampling (and doesn't cost anything)? – Matt Joiner Sep 14 '10 at 11:58
  • @Matt: Well, pausing to get stackshots under a debugger doesn't cost anything, and it finds the problems, and gprof doesn't, so to me at least, that's an alternative. Then again, there's Zoom and LTProf (for money). – Mike Dunlavey Sep 14 '10 at 13:21
  • @Mike Dunlavey: I find that that breaking under a debugger still won't pause while the process is executing in the kernel, for instance waiting for IO to complete. That's the only thing gprof _doesn't_ do. Breaking and observing manually is just the same as gprof without the instrumentation. – Matt Joiner Sep 14 '10 at 16:06
  • @Matt: The debuggers I use (VC and VStudio) will break during IO and show me the stack down to the IO call, which is exactly what I need. Not sure about GDB, but even if it doesn't break until the IO completes, that's good enough because then the stack shows you the IO call in which the pause occurred, and the stack leading to it. If you get many of those, that tells you you're spending a good percent of time in that IO, and you can see from the stack if you need to. The difference from gprof (besides not capturing the stack) is (like in the visual studio profiler) ... – Mike Dunlavey Sep 14 '10 at 16:14
  • 1
    @Matt: ... that if you hit that key during a blocking call, you *find out* about it. You find out that you hit it during the blocking call, and you find out why it was there. In gprof and the VS profiler, you don't find out, because any sample during blocking is discarded, so it is totally blind to IO, and at some point during optimization, IO may well be the dominant performance problem. – Mike Dunlavey Sep 14 '10 at 16:18
  • 2
    @Matt: Examples of IO performance problems found this way: 1) printing log messages to a file or the console, which was erroneously thought to be insignificant. 2) Converting between text and doubles in numeric IO. 3) Subterranean IO extracting internationalized strings during startup, strings that it turns out did not need to be internationalized. I've hit lots of examples like these. – Mike Dunlavey Sep 14 '10 at 16:38
  • 2
    @Matt: Example: consider a program that loops 10,000 times. Inside the loop, it crunches numbers for 1ms (total 10s), and then does blocking IO for 9ms (total 90s), so 90% of its time is blocked. gprof and VS profiler will ignore the 90% and act as if the 10% was all there was to tell you about. Stackshots will show you that 90% of the time it is blocked, and why, and that 10% of the time it was crunching, and why. Of course, if you want it to go much faster, which is the part you need to work on? – Mike Dunlavey Sep 14 '10 at 18:25
  • @Mike Dunlavey: I agree with you in principle, I'm just not sure all debuggers are up to it. When using MSVC I do exactly the same thing. Your separation of IO (kernel) and actual CPU profiling are correct (I'm writing a filesystem atm, so naturally this is all very close to home and familiar). OProfile and Zoom are reasonable, but not brilliant. Callgrind is vaguely useful too (but has too much overhead), although it works very similarly to the debugger model you've given. – Matt Joiner Sep 14 '10 at 22:59
  • @Matt: It may be that not all debuggers are up to it, especially if the code you're trying to optimize is inside the kernel. Then you may need bigger guns to get the stackshots. For me, that meant using an in-circuit-emulator box. Also, maybe a software emulator can do it like Valgrind, I don't know. It may not be easy, but there usually is a way to do it. Not to brag too much, but so many times I've just had to go in and do it, while people are standing around guessing. – Mike Dunlavey Sep 15 '10 at 00:59
  • Good optimization skills are pretty rare. Many programmers never do it, don't need to do it, don't know how to do it or do it too early. – Matt Joiner Sep 15 '10 at 02:35
  • @Matt: Rare, yes, sadly, because it's so easy. And you don't need to do it until you do, and then you're stuck. SO is filled with questions and answers about how to guess at performance problems, when finding out is so simple. – Mike Dunlavey Sep 15 '10 at 12:37
  • @Mike Dunlavey: I think because there's a real lack of quality benchmarking tools. I'm not one to _rely_ on tools Windows-style (gotta catch 'em all), but a few good strong utilities.. gcc, gdb, make, gedit... gprof. No corresponding tool of choice exists when it comes to profilers. – Matt Joiner Sep 15 '10 at 12:51
  • @Matt: I'm sure I'm old-fashioned, but tools don't excite me unless they do. For timing, a stopwatch is all I've really needed, though it's nice to have time printed out. For profiling, MS IDEs are totally adequate if they have a pause button. I don't like GDB but it also works. There's no performance problem I can't find, quickly, with these. I'm amazed anybody uses gprof, for all the reasons above. – Mike Dunlavey Sep 15 '10 at 13:47
  • 1
    For some arguments against this approach, see [this question by Mike Dunlavey](http://stackoverflow.com/questions/266373/one-could-use-a-profiler-but-why-not-just-halt-the-program) – Andrew Grimm Sep 28 '10 at 23:36
  • @Andrew: It's charitable to call them arguments. Careful thinking is in short supply. – Mike Dunlavey Sep 29 '10 at 15:23
  • @Andrew: I just re-worded point 3, if you're interested. – Mike Dunlavey Nov 01 '10 at 14:54
  • @mike: +1, I think I'm becoming a fan :) Reminds me of old talks with strange friends about how to spare some Z80 cycles. I will hunt your answers :) – neuro Feb 10 '11 at 18:31
  • @neuro: Lots of people know this method. They're sprinkled around SO. I'm just the one making the most noise, because so many people *don't* know it. Many people think if you pay more, then certainly you get more, but not so in this case. I made a profiler about 17 years ago based on this idea (similar to Zoom). It was attractive, but oddly enough for serious work it was not as good as the totally manual method, because it's not about measuring, and there seems to be no substitute for getting right in the code and studying representative states. – Mike Dunlavey Feb 10 '11 at 20:41
  • @Mike: Yes, you are right. Even if I did not know this manual method, I already think that gprof is not the tool, as it misses so many things (besides being a pain to use). Cachegring is no better as with a real big complex software it is not up to the task (crash, or no useful info). You brightly explain that. I've tried zoom which give me comprehensive results. Well, nothing is perfect as I work on a complex soft that has many different paths triggered on very different conditions during time, and it is hard to do your manual method ... even if it is hard to apply any automatic one too :) – neuro Feb 11 '11 at 10:45
  • 1
    @neuro: I've painfully discovered you can't optimize things if the program isn't *doing* them. I get some use cases beautifully optimized, and then along comes a use case I didn't get to, and it's slow. When I'm being a good boy I think in terms of coverage, to make sure I've tested all possible cases, but generally for that I need help. – Mike Dunlavey Feb 11 '11 at 12:10
  • @Mike: yes, so true. When I can I try to keep the tree of possible and test the leaf modules independently ... That's a lot of work and of course you can not test complex use cases involving the whole. Coverage helps, of course... Well, it's why we are paid after all :) Nice to talk with you. – neuro Feb 11 '11 at 12:28
  • 2
    @NormanRamsey: FYI, I just posted on *[sourceforge](https://sourceforge.net/projects/randompausedemo/files/)* an updated version (in C++) of the example speedup case. It has all the series of code versions, and the stack samples. The compounded speedup factor is about 700. – Mike Dunlavey Nov 27 '11 at 22:31
  • @Mike: I found the download pretty confusing. I'm totally sold on the idea of stack sampling, but I sure would like better ways of visualizing the results. – Norman Ramsey Nov 28 '11 at 05:31
  • @Norman: I guess I'm trying to say that visualizing is the wrong thing to do with the results. At the time I collect each stack sample, I'm in the debugger, and I go up the stack to examine the code where it's "at" to understand why it's there. Then when I see the same "why" on >1 sample, I know I've found a nugget. Often no single line of code or function is on enough samples to stand out, so profilers don't see the problem. Note: I don't assert the problem is always easy to fix (like replacing the vector class), only easy to find. – Mike Dunlavey Nov 28 '11 at 15:39
  • @NormanRamsey: Let me see if I can give you a better answer. Some sort of line-of-code-based butterfly view is what I like, but that implies the problem localizes nicely. Often the problem is not best described as a code location, but as some other description. Then looking only at locations will miss the problem (as in these examples). It's not OK to miss a problem, because then that stalls the speedup progression. – Mike Dunlavey Nov 29 '11 at 14:30
76

Valgrind has an instruction-count profiler with a very nice visualizer called KCacheGrind. As Mike Dunlavey recommends, Valgrind counts the fraction of instructions for which a procedure is live on the stack, although I'm sorry to say it appears to become confused in the presence of mutual recursion. But the visualizer is very nice and light years ahead of gprof.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Norman Ramsey
  • 188,173
  • 57
  • 343
  • 523
  • 2
    @Norman: ++ That confusion about recursion seems endemic to systems that have the concept of propogating times among nodes in a graph. Also I think wall-clock time is generally more useful than CPU instruction times, and code lines (call instructions) are more useful than procedures. If stack samples at random wall clock times are taken, then the fractional cost of a line (or procedure, or any other description you can make) is simply estimated by the fraction of samples that exhibit it. – Mike Dunlavey Dec 20 '09 at 13:50
  • 1
    ... I'm emphasizing call instructions, but it applies to any instructions. If one has an honest-to-goodness hotspot bottleneck, such as a bubble sort of a large array of numbers, then the compare/jump/swap/increment instructions of the inner loop will be at the top/bottom of nearly every stack sample. But (especially as software gets big and hardly any routine has much "self" time) many problems actually are call instructions, requesting work that, when it is clear how much it costs, doesn't *really* have to be done. – Mike Dunlavey Dec 20 '09 at 14:07
  • 3
    ... Check this out. I think they are nearly on the right track: http://www.rotateright.com/zoom.html – Mike Dunlavey Dec 20 '09 at 14:18
65

Since I did't see here anything about perf which is a relatively new tool for profiling the kernel and user applications on Linux I decided to add this information.

First of all - this is a tutorial about Linux profiling with perf

You can use perf if your Linux Kernel is greater than 2.6.32 or oprofile if it is older. Both programs don't require from you to instrument your program (like gprof requires). However in order to get call graph correctly in perf you need to build you program with -fno-omit-frame-pointer. For example: g++ -fno-omit-frame-pointer -O2 main.cpp.

You can see "live" analysis of your application with perf top:

sudo perf top -p `pidof a.out` -K

Or you can record performance data of a running application and analyze them after that:

1) To record performance data:

perf record -p `pidof a.out`

or to record for 10 secs:

perf record -p `pidof a.out` sleep 10

or to record with call graph ()

perf record -g -p `pidof a.out` 

2) To analyze the recorded data

perf report --stdio
perf report --stdio --sort=dso -g none
perf report --stdio -g none
perf report --stdio -g

Or you can record performace data of a application and analyze them after that just by launching the application in this way and waiting for it to exit:

perf record ./a.out

This is an example of profiling a test program

The test program is in file main.cpp (I will put main.cpp at the bottom of the message):

I compile it in this way:

g++ -m64 -fno-omit-frame-pointer -g main.cpp -L.  -ltcmalloc_minimal -o my_test

I use libmalloc_minimial.so since it is compiled with -fno-omit-frame-pointer while libc malloc seems to be compiled without this option. Then I run my test program

./my_test 100000000 

Then I record performance data of a running process:

perf record -g  -p `pidof my_test` -o ./my_test.perf.data sleep 30

Then I analyze load per module:

perf report --stdio -g none --sort comm,dso -i ./my_test.perf.data

# Overhead  Command                 Shared Object
# ........  .......  ............................
#
    70.06%  my_test  my_test
    28.33%  my_test  libtcmalloc_minimal.so.0.1.0
     1.61%  my_test  [kernel.kallsyms]

Then load per function is analyzed:

perf report --stdio -g none -i ./my_test.perf.data | c++filt

# Overhead  Command                 Shared Object                       Symbol
# ........  .......  ............................  ...........................
#
    29.30%  my_test  my_test                       [.] f2(long)
    29.14%  my_test  my_test                       [.] f1(long)
    15.17%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator new(unsigned long)
    13.16%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator delete(void*)
     9.44%  my_test  my_test                       [.] process_request(long)
     1.01%  my_test  my_test                       [.] operator delete(void*)@plt
     0.97%  my_test  my_test                       [.] operator new(unsigned long)@plt
     0.20%  my_test  my_test                       [.] main
     0.19%  my_test  [kernel.kallsyms]             [k] apic_timer_interrupt
     0.16%  my_test  [kernel.kallsyms]             [k] _spin_lock
     0.13%  my_test  [kernel.kallsyms]             [k] native_write_msr_safe

     and so on ...

Then call chains are analyzed:

perf report --stdio -g graph -i ./my_test.perf.data | c++filt

# Overhead  Command                 Shared Object                       Symbol
# ........  .......  ............................  ...........................
#
    29.30%  my_test  my_test                       [.] f2(long)
            |
            --- f2(long)
               |
                --29.01%-- process_request(long)
                          main
                          __libc_start_main

    29.14%  my_test  my_test                       [.] f1(long)
            |
            --- f1(long)
               |
               |--15.05%-- process_request(long)
               |          main
               |          __libc_start_main
               |
                --13.79%-- f2(long)
                          process_request(long)
                          main
                          __libc_start_main

    15.17%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator new(unsigned long)
            |
            --- operator new(unsigned long)
               |
               |--11.44%-- f1(long)
               |          |
               |          |--5.75%-- process_request(long)
               |          |          main
               |          |          __libc_start_main
               |          |
               |           --5.69%-- f2(long)
               |                     process_request(long)
               |                     main
               |                     __libc_start_main
               |
                --3.01%-- process_request(long)
                          main
                          __libc_start_main

    13.16%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator delete(void*)
            |
            --- operator delete(void*)
               |
               |--9.13%-- f1(long)
               |          |
               |          |--4.63%-- f2(long)
               |          |          process_request(long)
               |          |          main
               |          |          __libc_start_main
               |          |
               |           --4.51%-- process_request(long)
               |                     main
               |                     __libc_start_main
               |
               |--3.05%-- process_request(long)
               |          main
               |          __libc_start_main
               |
                --0.80%-- f2(long)
                          process_request(long)
                          main
                          __libc_start_main

     9.44%  my_test  my_test                       [.] process_request(long)
            |
            --- process_request(long)
               |
                --9.39%-- main
                          __libc_start_main

     1.01%  my_test  my_test                       [.] operator delete(void*)@plt
            |
            --- operator delete(void*)@plt

     0.97%  my_test  my_test                       [.] operator new(unsigned long)@plt
            |
            --- operator new(unsigned long)@plt

     0.20%  my_test  my_test                       [.] main
     0.19%  my_test  [kernel.kallsyms]             [k] apic_timer_interrupt
     0.16%  my_test  [kernel.kallsyms]             [k] _spin_lock
     and so on ...

So at this point you know where your program spends time.

And this is main.cpp for the test:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

time_t f1(time_t time_value)
{
  for (int j =0; j < 10; ++j) {
    ++time_value;
    if (j%5 == 0) {
      double *p = new double;
      delete p;
    }
  }
  return time_value;
}

time_t f2(time_t time_value)
{
  for (int j =0; j < 40; ++j) {
    ++time_value;
  }
  time_value=f1(time_value);
  return time_value;
}

time_t process_request(time_t time_value)
{

  for (int j =0; j < 10; ++j) {
    int *p = new int;
    delete p;
    for (int m =0; m < 10; ++m) {
      ++time_value;
    }
  }
  for (int i =0; i < 10; ++i) {
    time_value=f1(time_value);
    time_value=f2(time_value);
  }
  return time_value;
}

int main(int argc, char* argv2[])
{
  int number_loops = argc > 1 ? atoi(argv2[1]) : 1;
  time_t time_value = time(0);
  printf("number loops %d\n", number_loops);
  printf("time_value: %d\n", time_value );

  for (int i =0; i < number_loops; ++i) {
    time_value = process_request(time_value);
  }
  printf("time_value: %ld\n", time_value );
  return 0;
}
osgx
  • 80,853
  • 42
  • 303
  • 470
  • I just ran your example & took 5 stackshots. Here's what they found: 40% (roughly) of the time `f1` was calling `delete`. 40% (roughly) of the time `process_request` was calling `delete`. A good part of the remainder was spent in `new`. The measurements are rough, but the hotspots are pinpointed. – Mike Dunlavey Jan 16 '13 at 01:55
  • What is a `stackshot`? Is it that `pstack` outputs? –  Jan 16 '13 at 14:36
  • As in my answer, you run it under a debugger and hit ^C at a random time and capture the stack trace. Or you can use pstack. You can find things that way that profilers will not find. – Mike Dunlavey Jan 16 '13 at 15:50
  • 2
    `As in my answer, you run it under a debugger and hit ^C at a random time and capture the stack trace`. 1) I think that your technique is not useful when you need to analyze performance problems for a program running on your customer's server. 2) I am not sure how you apply this technique to get information for a program having lots of threads that handle different requests. I mean when the general picture is quite complicated. –  Apr 30 '13 at 10:46
  • 1) It's for finding out what in your own code you can fix to make things faster. On a customer's server, chances are the problem is outside your code - protocol based problems, for example. 2) When the general picture is complicated, you need a way to simplify it, by zeroing in on one request type at a time, for example. The thing to avoid is any sort of holistic approach, looking at the forest rather than the trees. You gotta get down to where "the rubber meets the road". – Mike Dunlavey Apr 30 '13 at 11:25
  • 2
    As for #1. Sometimes customers call and say that your program works slowly. You can't say immediately that `the problem is outside your code`, can you? Since you might need some information in order to support your point. In this situation you at some point might need to profile your application. You can't just ask your customer to start gdb and press ^C and get call stacks. This was my point. This is an example http://spielwiese.fontein.de/2012/01/22/a-problem-with-pthread_once-on-an-out-dated-solaris-installation. I had this problem and profiling helped a lot. –  Apr 30 '13 at 14:31
  • 2
    As for #2. Simplifying is a good approach, I agree. Sometimes it works. If a performance problem occur only on a customer's server and you can't reproduce them on your server then profiles are of use. –  Apr 30 '13 at 15:59
  • Sure, if you can't grab samples, anything is better than wondering and guessing, and I'm always careful to say when you have processes that involve message-waiting between cooperating asynchronous agents, stack samples will only take you so far. For situations like that, I've used a logging method that is not fun but gets the job done. There certainly are ways to make the process difficult :) – Mike Dunlavey Apr 30 '13 at 17:54
24

Try OProfile. It is a much better tool for profiling your code. I would also suggest Intel VTune.

The two tools above can narrow down time spent in a particular line of code, annotate your code, show assembly and how much particular instruction takes. Beside time metric, you can also query specific counters, i.e. cache hits, etc.

Unlike gprof, you can profile any process/binary running on your system using either of the two.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Anycorn
  • 46,748
  • 41
  • 153
  • 250
18

Google performance tools include a simple to use profiler. CPU as well as heap profiler is available.

Sandeep
  • 618
  • 5
  • 12
8

Take a look at Sysprof.

Your distribution may have it already.

Søren Sandmann
  • 539
  • 3
  • 3
4

http://lttng.org/ if you want a high performance tracer

Guillaume Massé
  • 7,057
  • 6
  • 37
  • 54