Should I use matching (gcc) compiler optimization flags when profiling the code?

Question

I am using -O3 when compiling the code, and now I need to profile it. For profiling, there are two main choices I came accross: valgrind --tool=callgrind and gprof.

Valgrind (callgrind) docs state:

As with Cachegrind, you probably want to compile with debugging info (the -g option) and with optimization turned on.

However, in the C++ optimization book by Agner Fog, I have read the following:

Many optimization options are incompatible with debugging. A debugger can execute a code one line at a time and show the values of all variables. Obviously, this is not possible when parts of the code have been reordered, inlined, or optimized away. It is common to make two versions of a program executable: a debug version with full debugging support which is used during program development, and a release version with all relevant optimization options turned on. Most IDE's (Integrated Development Environments) have facilities for making a debug version and a release version of object files and executables. Make sure to distinguish these two versions and turn off debugging and profiling support in the optimized version of the executable.

This seems to conflict the callgrind instructions to compile the code with the debugging info flag -g. If I enable debugging in the following way:

-ggdb -DFULLDEBUG

am I not causing this option to conflict with the -O3 optimization flag? Using those two options together makes no sense to me after what I have read so far.

If I use say -O3 optimization flag, can I compile the code with additional profiling info by using:

-pg

and still profile it with valgrind?

Does it ever make sense to profile a code compiled with

-ggdb -DFULLDEBUG -O0

flags? It seems silly - not inlining functions and unrolling loops may shift the bottlenecks in the code, so this should be used for development only, to get the code to actually do stuff properly.

Does it ever make sense to compile the code with one optimization flag, and profile the code compiled with another optimization flag?

Profiling with a different optimization flag would make no sense, because you would be profiling different code. Note that profiling is not the same as debugging. — juanchopanza, Feb 14 '14 at 10:15
@juanchopanza: thanks, I guessed that as well.. But why does callgrind ask for the '-g' flag to be used? Using `-g` allows the debugger to provide information on the stack, which conflicts the `-O` option, right? — tmaric, Feb 14 '14 at 10:18
`-g` doesn't conflict too much with the `-O` flags. There's a nice discussion here: http://stackoverflow.com/questions/89603/how-does-the-debugging-option-g-change-the-binary-executable — juanchopanza, Feb 14 '14 at 10:23

score 2 · Accepted Answer · edited May 23 '17 at 12:07

2

Why are you profiling? Just to get measurements or to find speedups?

The common wisdom that you should only profile optimized code is based on assuming the code is nearly optimal to begin with, which if there are significant speedups, it is not.

You should treat the finding of speedups as if they were bugs. Many people use this method of doing so.

After you've removed needless computations, if you still have tight CPU loops, i.e. you're not spending all your time in system or library or I/O routines the optimizer doesn't see, then turn on -O3, and let it do its magic.

edited May 23 '17 at 12:07

Community

1
1

answered Feb 14 '14 at 14:21

Mike Dunlavey

38,662
12
86
126

I already have the code up and running and providing expected results and I have done my own measurements using `std::chrono` to find out what global algorithm sub-sections cause most bottlenecks. I just want to see how much I will gain when I optimize certain sub-algorithms. The problem is definitely in two algorithms that I coded myself, so basically I'm interested in quantifying how much faster the new version of the method will be. Thanks for the alternative method, I will check it out. – tmaric Feb 15 '14 at 08:49
I've read all your posts I could find online on 'random pausing', I think I get the difference between the measurements of the profiler and examining the stack myself. In one post you mentioned that 10 samples should be enough to find the culprit (part of the code wher I landed multiple times) in the current optimization cycle - is this coming from your experience? – tmaric Feb 20 '14 at 11:42
1

@tomislav-maric: I have been a pest on this subject :-). I've been using this method since before profilers even existed, and it continues to amaze me that it is not general knowledge. The common assumption seems to be that any tool has got to be better than what a human can do, and sure it's true in other areas, but not in profiling. The idea that measurement accuracy is necessary (therefore a large number of samples) has no foundation whatever, and the price being paid for it is missing possible speedups. Until a profiler can understand code as well as a person can, this will be so. – Mike Dunlavey Feb 20 '14 at 15:30
:) It was a great read! The usual comments regarding your posts were in the tone (at least that's how I got them) 'this seems hard', probably because your approach to 'bottlenecks' is to regard them as bugs, and switching to 'performance bugs' right after the code has just started to produce expected results could be difficult psychologically, I guess - it's easier to run valgrind and switch a datastructure. From what I have picked up, you do not exit a debugger until your 10-20 samples stop landing in repeated instructions, right? – tmaric Feb 20 '14 at 15:45
1

@tomislav-maric: It might not be repeated instructions. e.g. It might be that `new` is on the stack from various places, or that several different container classes are trying to grow. These are things a profiler could not tie together, but you see it immediately. If a problem costs, say, 10% of time, then the average number of samples you need to see it twice is 2/0.1 = 20. Typically problems are bigger than that, so it takes fewer samples to see them twice. Something you see only once may be interesting, but you need to see it two or more times to be sure it's worth fixing. – Mike Dunlavey Feb 20 '14 at 15:53
1

thanks a lot for the tips! Acutally, I am working on a code that does geometrical calculations, and the datastructures change sizes often. The only way I thought about this so far was to switch datastructures and see which ones behave best for a set of test cases. Using your method I might be able to understand under what conditions this happens, and think about a better container expansion policy. – tmaric Feb 20 '14 at 16:03
@tomislav-maric: I realized I didn't answer your question. Each sample says the program is doing something, of some arbitrary description, and typically every sample will be different. When you stop is when you've seen at least two of them, not necessarily in sequence, that are doing something similar, that you can think of a way to improve. The larger the speedup opportunity is, the fewer samples it takes you to see it twice. Like if you take two samples, and you see it doing X on both of them, you know it is spending most of its time doing X. More samples will give more certainty. – Mike Dunlavey Feb 22 '14 at 16:44

Should I use matching (gcc) compiler optimization flags when profiling the code?

1 Answers1

Linked