Why instrumented C program runs faster?

Question

I am working on a (quite large) existing monothreaded C application. In this context I modified the application to perform some very few additional work consisting in incrementing a counter each time we call a special function (this function is called ~ 80.000 times). The application is compiled on an Ubuntu 12.04 running a 64 bits Linux kernel 3.2.0-31-generic with -O3 option.

Surprisingly the instrumented version of the code is running faster and I am investigating why.I measure execution time with clock_gettime(CLOCK_PROCESS_CPUTIME_ID) and to get representative results, I am reporting an average execution time value over 100 runs. Moreover, to avoid interference from outside world, I tried as much as possible to launch the application in a system without any other applications running (on a side note, because CLOCK_PROCESS_CPUTIME_ID returns process time and not wall clock time, other applications "should" in theory only affect cache and not directly the process execution time)

I was suspecting "instruction cache effects", maybe the instrumented code that is a little bit larger (few bytes) fits differently and better in the cache, is this hypothesis conceivable ? I tried to do some cache investigations with valegrind --tool=cachegrind but unfortunately, the instrumented version has (as it seems logical) more cache misses than the initial version.

Any hints on this subject and ideas that may help to find why instrumented code is running faster are welcomes (some GCC optimizations available in one case and not in the other, why ?, ...)

Without knowing any of your code, it becomes difficult to give a definite answer to your question. — fuz, Oct 01 '12 at 10:02
@FUZxxl: I think that's why the question is, "what factors can I consider in doing my work", instead of the usual SO format, "please do my work for me" ;-) — Steve Jessop, Oct 01 '12 at 10:05
@FUZxxl as stated by Steve Jessop it's difficult to provide code here because it's quite big, and I am asking for help about the directions to take and not an answer, saying "the problem is here on line 345 of file mem.c". if you have any hint, please feel free to add an answer. — Manuel Selva, Oct 01 '12 at 11:59
Have you peeked at the generated code for both versions (i.e. `gcc -S`)? Also: Cachegrind, IIRC, 'simulates' a cache, so I'm not sure how well it represents a real run. Really weird thought: How small is the function? What if it normally got inlined, and your counter increment made it *just* big enough for GCC not to inline it? Hence giving you *smaller* code size overall? — ArjunShankar, Oct 01 '12 at 12:17
@ArjunShankar Thanks for suggestions. Yep I looked at generated assembler file and didn't noticed any "particular" thing. Yep cachegrind is performing simulation but I was not able to find any other tool for looking to cache misses (see related question here http://stackoverflow.com/questions/12601474/what-are-perf-cache-events-meaning). — Manuel Selva, Oct 01 '12 at 12:29
@ArjunShankar I am not sure to understand your last point. the function is really small, so I guess it's inlined in both cases but I am going to check that. Nevertheless if adding counter increment resulting to gcc no more inlining the function, how can this explain performance increase (inlining purpose is to increase execution performances) — Manuel Selva, Oct 01 '12 at 12:32
@ManuelSelva - [Inlining isn't always a win](http://en.wikipedia.org/wiki/Inline_expansion). Sometimes it reduces performance. Anyway, my comment was mostly wild guesswork. — ArjunShankar, Oct 01 '12 at 12:38
@ManuelSelva - How about some more info, or a follow-up on the inlining stuff? Or both? — ArjunShankar, Oct 03 '12 at 11:54

score 4 · Accepted Answer · answered Oct 03 '12 at 13:06

Since there are not many details in the question, I can only recommend some factors to consider while investigating the problem.

Very few additional work (such as incrementing a counter) might alter compiler's decision on whether to apply some optimizations or not. Compiler has not always enough information to make perfect choice. It may try to optimize for speed where bottleneck is code size. It may try to auto-vectorize computations when there is not too much data to process. Compiler may not know what kind of data is to be processed or what is the exact model of CPU, that will execute the code.

Incrementing a counter may increase size of some loop and prevent loop unrolling. This may decrease code size (and improve code locality, which is good for instruction or microcode caches or for loop buffer and allows CPU to fetch/decode instructions quickly).
Incrementing a counter may increase size of some function and prevent inlining. This also may decrease code size.
Incrementing a counter may prevent auto-vectorization, which again may decrease code size.

Even if this change does not affect compiler optimization, it may alter the way how the code is executed by CPU.

If you insert counter-incrementing code in place, full of branch targets, this may make branch targets less dense and improve branch prediction.
If you insert counter-incrementing code in front of some particular branch target, this may make branch target's address better aligned and make code fetch faster.
If you place counter-incrementing code after some data is written but before the same data is loaded again (and store-to-load forwarding did not work for some reason), the load operation may be completed earlier.
Insertion of counter-incrementing code may prevent two conflicting load attempts to the same bank in L1 data cache.
Insertion of counter-incrementing code may alter some CPU scheduler decision and make some execution port available just in time for some performance-critical instruction.

To investigate effects of compiler optimization, you can compare generated assembler code before and after addition of counter-incrementing code.

To investigate CPU effects, use a profiler allowing to inspect processor performance counters.

+1 for compare assembler code. That would be my first place to investigate as well. — Bruno Kim, Oct 03 '12 at 13:15

score 1 · Answer 2 · answered Oct 03 '12 at 11:39

Just guessing from my experience with embedded compilers, Optimization tools in compilers look for recursive tasks. Perhaps the additional code forced the compiler to see something more recursive and it structured the machine code differently. Compilers do some weird things for optimization. In some languages (Perl I think?) a "not not" conditional is faster to execute than a "true" conditional. Does your debugging tool allow you to single step through a code/assembly comparison? This could add some insight as to what the compiler decided to do with the extra tasks.

Why instrumented C program runs faster?

2 Answers2