1

While testing the work of custom heap manager (to replace system one) I have encountered some slowdowns in comparison to system heap.

I used AMD CodeAnalyst for profiling x64 application on Windows 7, Intel Xeon CPU E5-1620 v2 @ 3.70 GHz. And got the following results:

Profile results

This block consumes about 90% of the time for the whole application run. We can see a lot of time spent on "cmp [rsp+18h], rax" and "test eax, eax" but no time spent on jumps right below the compares. Is it ok that jumps take no time? Is it because of branch prediction mechanism?

I changed the clause to the opposite and here what I've got (the results are a bit different in absolute numbers because I manually stopped profiling sessions - but still a lot of time is taken by compares): enter image description here

There are so many calls to these compares that they become a bottle-neck... This is how I can interpret these results. And probably the best optimization is reworking the algorithm, right?

greenpiece
  • 511
  • 7
  • 18
  • I have managed to improve the algorithm, now custom heap is of the same speed as system one in release config. However, the question remains, why do compares take time but no jumps? – greenpiece May 03 '16 at 17:59
  • You're writing a heap manager so to test it you must be loading it heavily. You're clearly looking at the inner loop. If you want it to be fast, use overall elapsed time to measure speedup. What the percent is for is to tell you where to look for speedups. If you push down the percent in one piece of code, you raise it up in another, because it still has to add up to 100%. So sure, it could be a matter of cache misses and branch prediction. That's the CPU engineers trying to save you time. But you can look at this and say "can I do this fewer times?" That's how to save time. – Mike Dunlavey May 06 '16 at 21:30

1 Answers1

1

Intel and AMD CPUs both macro-fuse cmp/jcc pairs into a single compare-and-branch uop (Intel) or macro-op (AMD). Intel SnB-family CPUs like yours can do this with some instructions that also write an output register, like and, sub/add, inc/dec.

To really understand profiling data, you have to understand something about how the out-of-order pipeline works in the microarch you're tuning on. See the links at the tag wiki, especially Agner Fog's microarch pdf.

You should also beware that profiling cycle counts can get charged to the instruction that's waiting for results, not the instruction that is slow to produce them.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606