Calculating the speed of routines?

Question

What would be the best and most accurate way to determine how long it took to process a routine, such as a procedure of function?

I ask because I am currently trying to optimize a few functions in my Application, when i test the changes it is hard to determine just by looking at it if there was any improvements at all. So if I could return an accurate or near accurate time it took to process a routine, I then have a more clear idea of how well, if any changes to the code have been made.

I considered using GetTickCount, but I am unsure if this would be anything near accurate?

It would be useful to have a resuable function/procedure to calculate the time of a routine, and use it something like this:

// < prepare for calcuation of code
...
ExecuteSomeCode; // < code to test
...
// < stop calcuating code and return time it took to process

I look forward to hearing some suggestions.

Thanks.

Craig.

score 24 · Accepted Answer · answered May 17 '11 at 12:26

24

From my knowledge, the most accurate method is by using QueryPerformanceFrequency:

code:

var
  Freq, StartCount, StopCount: Int64;
  TimingSeconds: real;
begin
  QueryPerformanceFrequency(Freq);
  QueryPerformanceCounter(StartCount);
  // Execute process that you want to time: ...
  QueryPerformanceCounter(StopCount);
  TimingSeconds := (StopCount - StartCount) / Freq;
  // Display timing: ... 
end;

answered May 17 '11 at 12:26

RBA

11,762
14
72
118

I tried like so: 'procedure TForm1.Button1Click(Sender: TObject); var i: Integer; begin QueryPerformanceFrequency(Freq); QueryPerformanceCounter(StartCount); for i:= 0 to 10000 do begin ListBox1.Items.Add('Item ' + IntToStr(i)); end; QueryPerformanceCounter(StopCount); TimingSeconds := (StopCount - StartCount) / Freq; ShowMessage(IntToStr(Round((TimingSeconds)))); end;' It returns as in seconds, how to return this in Milliseconds? – May 17 '11 at 12:32
here you have a fully functional example - http://delphi.about.com/od/windowsshellapi/a/delphi-high-performance-timer-tstopwatch.htm – RBA May 17 '11 at 12:42
4

Convert seconds to milliseconds by dividing by 1000 – David Heffernan May 17 '11 at 12:43
1

@David: You mean *multiplying*, don't you? – Ken White May 17 '11 at 12:49
@Ken Yes, that's what I meant! – David Heffernan May 17 '11 at 12:52
1

@Ken, @David, dividing the divider before the division might be a better idea! That is, `Freq := Freq div 1000` before `TimingMilliseconds := (Stop - Start) / Freq`. – Cosmin Prund May 17 '11 at 12:58
1

oh dear, why I didnt I think of doing this! – May 17 '11 at 13:17
4

@RBA, the overhead when using QueryPerformanceCount is 19 microseconds per call according to: http://support.microsoft.com/kb/172338. If you use RDTSC the overhead is a few cycles of the CPU. On short routines, the overhead of QueryPerformanceCounter can be much larger than the total running time of your code. – Johan May 17 '11 at 13:27
1

If you use this routine, be sure to time the overhead involved in using it, and subtract that from the running time of your routine to get **accurate** results. – Johan May 17 '11 at 13:29
thanks for the extra information, I will be sure to consider that when determining the results. – May 17 '11 at 13:43
1

@Johan RDTSC is NOT reliable on modern CPU with multi-core and changing frequency. See http://msdn.microsoft.com/en-us/library/ee417693%28VS.85%29.aspx – Arnaud Bouchez May 17 '11 at 15:46
1

@A.Bouchez, It is not reliable for **time** but when profiling you are not interested in **time** you are interested in **CPU cycles**. `QueryPerformanceCounter` is **not accurate** in CPU cycles and therefore not suitable for profiling, see the edit in my answer, and the link to the official documentation from Intel. – Johan May 17 '11 at 17:20
@Cosmin: I was only commenting on David's comment, not the answer. :) – Ken White May 20 '11 at 00:27
@Johan RDTSC alone is not accurate on cycles as well: as soon as you have memory latencies thrown in the mix, the frequency changes will have varying impacts on main memory and CPU. The only accurate way to measure things is to turn off CPU frequency changes while benchamrking. – Eric Grange Jun 07 '13 at 15:54

score 18 · Answer 2 · answered May 17 '11 at 12:20

18

Try Eric Grange's Sampling Profiler.

answered May 17 '11 at 12:20

Ondrej Kelle

36,175
2
60
122

+1 I'd always prefer to use a profiler than roll my own routines. However, try and measure with your wristwatch once you've done your optimisation to take a reality check! – David Heffernan May 17 '11 at 12:57
I read the blurb. I can't tell from the wording if it 1) samples the call stack, 2) reports by line the percent of samples containing that line, 3) samples on wall-clock time. Do you know if it does? – Mike Dunlavey May 17 '11 at 13:19
To measure exactly a single routine a sampling profiler may not be the best choice, it is better when you have to understand where a larger code is slow and then drill down. When it comes to relatively small pieces of code a classic profiler would usually return more precise informations. – May 17 '11 at 13:34
@Mike: I don't know how it works internally and I'm not sure I even understand what you're asking but I think it regularly examines which code is currently executing, and keeps a hit count for every called function/method. It also analyses the call stack and keeps stats for all callers. – Ondrej Kelle May 17 '11 at 13:45
@ldsandon you're right, if it's about precise measurement my answer is not helpful. The profiler is better at showing you *which* code is executed often so improving it will have a greater impact. – Ondrej Kelle May 17 '11 at 13:49
Put another way: Suppose a particular line of code foo.cpp:37 calls a function, like `new`, and is on the stack 40% of the time, does the profiler assign 40% to that line? Suppose another line does `print` or `cout::< – Mike Dunlavey May 18 '11 at 13:28

score 14 · Answer 3 · edited Jun 20 '20 at 09:12

From Delphi 6 upwards you can use the x86 Timestamp counter.
This counts CPU cycles, on a 1 Ghz processor, each count takes one nanosecond.
Can't get more accurate than that.

function RDTSC: Int64; assembler;
asm
  // RDTSC can be executed out of order, so the pipeline needs to be flushed
  // to prevent RDTSC from executing before your code is finished.  
  // Flush the pipeline
  XOR eax, eax
  PUSH EBX
  CPUID
  POP EBX
  RDTSC  //Get the CPU's time stamp counter.
end;

On x64 the following code is more accurate, because it does not suffer from the delay of CPUID.

  rdtscp        // On x64 we can use the serializing version of RDTSC
  push rbx      // Serialize the code after, to avoid OoO sneaking in
  push rax      // subsequent instructions prior to executing RDTSCP.
  push rdx      // See: http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
  xor eax,eax
  cpuid
  pop rdx
  pop rax
  pop rbx
  shl rdx,32
  or rax,rdx

Use the above code to get the timestamp before and after executing your code.
Most accurate method possible and easy as pie.

Note that you need to run a test at least 10 times to get a good result, on the first pass the cache will be cold, and random harddisk reads and interrupts can throw off your timings.
Because this thing is so accurate it can give you the wrong idea if you only time the first run.

Why you should not use QueryPerformanceCounter()
QueryPerformanceCounter() gives the same amount of time if the CPU slows down, it compensates for CPU thottling. Whilst RDTSC will give you the same amount of cycles if your CPU slows down due to overheating or whatnot.
So if your CPU starts running hot and needs to throttle down, QueryPerformanceCounter() will say that your routine is taking more time (which is misleading) and RDTSC will say that it takes the same amount of cycles (which is accurate).
This is what you want because you're interested in the amount of CPU-cycles your code uses, not the wall-clock time.

From the lastest intel docs: http://software.intel.com/en-us/articles/measure-code-sections-using-the-enhanced-timer/?wapkw=%28rdtsc%29

Using the Processor Clocks

This timer is very accurate. On a system with a 3GHz processor, this timer can measure events that last less than one nanosecond. [...] If the frequency changes while the targeted code is running, the final reading will be redundant since the initial and final readings were not taken using the same clock frequency. The number of clock ticks that occurred during this time will be accurate, but the elapsed time will be an unknown.

When not to use RDTSC
RDTSC is useful for basic timing. If you're timing multithreaded code on a single CPU machine, RDTSC will work fine. If you have multiple CPU's the startcount may come from one CPU and the endcount from another.
So don't use RDTSC to time multithreaded code on a multi-CPU machine. On a single CPU machine it works fine, or single threaded code on a multi-CPU machine it is also fine.
Also remember that RDTSC counts CPU cycles. If there is something that takes time but doesn't use the CPU, like disk-IO or network than RDTSC is not a good tool.

But the documentation says RDTSC is not accurate on modern CPU's
RDTSC is not a tool for keeping track of time, it's a tool for keeping track of CPU-cycles.
For that it is the only tool that is accurate. Routines that keep track of time are not accurate on modern CPU's because the CPU-clock is not absolute like it used to be.

RDTSC is not to be used on Windows with modern CPU. In fact, its results depends on the current CPU state: modern CPUs can change its frequency (e.g. TurboBoost technology), and multi-core design makes it even more difficult to get accurate results, therefore rdtsc is NOT to be used. QueryPerformanceCounter() IS to be used instead. See http://msdn.microsoft.com/en-us/library/ee417693%28VS.85%29.aspx — Arnaud Bouchez, May 17 '11 at 15:45
@Bouchez, this is **exactly** why you want to use RDTSC! If your processor slows down, RDTSC will still give the correct running time in cycles. As for multithreading issues, standard Delphi code is single threaded so that issue only applies if you use multiple threads. It is `QueryPerformanceCounter()` that **does not give accurate** results when the CPU slows down. — Johan, May 17 '11 at 16:50
@Johan Delphi applications, especially those who need profiling, are commonly multithreaded (e.g. for a server). About QueryPerformanceCounter, the page you quote tends to say exactly the contrary: RDTSC is not accurate in case of a CPU change during thread execution, whereas Etime is accurate because it checks for the CPU it runs on. And overhead is NOT an issue during profiling. You don't want you app to be as fast as usual, you want to know what takes time, accurately. — Arnaud Bouchez, May 17 '11 at 20:35
@Bouchez, the OP is not talking about servers or threads. He wants to know how fast his routine is. In that case you want to know the cycles, not the wall-clock time. He want to know if the changes he's making to the routine is making things faster or slower. If your testing function has a lot of overhead, this messes up your timings. Thats why RDTSC is so great there's very little overhead and it counts the cycles exactly, not like `QueryPerformanceCounter()`, which can fluctuate a lot depending on if your CPU is running hot or not. — Johan, May 17 '11 at 22:00
@Johan Who on earth has a single core dev PC today? What's the profit of running the application with the thread affinity set to ONE CPU just for using RDTSC? The `QueryPerformanceCounter` overhead is NOT a problem, because you'll compute diff of timestamps. And the doc you're quoting is not "latest" doc. The date on top of page is the date of the page layour refresh. This Intel doc deals with Pentium II and refers to a SpeedStep techo dated from 2003 !!!!! — Arnaud Bouchez, May 18 '11 at 05:57
If you are using CPUID, you will have to PUSH EBX before executing it and POP EBX afterwards. This is because CPUID (also) modifies the EBX register, but Delphi expects ASM statments/routines to preserve EBX. — PhiS, May 18 '11 at 07:06
@A.Bouchez - you can set the affinity of the executing thread to one logical processor, then you make sure you get comparable readings from the Time Stamp Counter using this method. — PhiS, May 18 '11 at 07:09
@Bouchez, You are missing the point. `QueryPerformanceCounter` is make for measuring wall-clock time. RDTSC is made for measuring CPU-ticks. For use of RDTSC on multi-core CPU's see: http://stackoverflow.com/questions/3388134/rdtsc-accuracy-across-cpu-cores The `QueryPerformanceCounter` overhead **is** a big problem because it varies between different runs. — Johan, May 18 '11 at 07:13
RDTSC doesn't shield you from memory latencies effects when CPU frequency changes. In addition to adjusting the thread affinity, you also need to turn off frequency adjustements (maximum performance mode in Windows) — Eric Grange, Jun 07 '13 at 15:57
Thanks a lot, this is just what I need to time cache/memory access. QueryPerformanceCounter was too slow for this, all it did was show the 15 to 17 ticks which probably measuring itself in microseconds, the next few calls to itself are probably predicted/cached so it went down to 7 ticks, but basically because of the microseconds overhead it was completely meanless/unusuable timing information ;) All it does is show 7 microseconds for everything is my best guess ! ;) :) perhaps it's a bit different based on it's frequency but it's somewhere in that ballpark ;) — Skybuck Flying, Jul 29 '18 at 23:45

score 10 · Answer 4 · answered May 17 '11 at 13:27

10

You didn't specify your Delphi version, but Delphi XE has a TStopWatch declared in unit Diagnostics. This will allow you to measure the runtime with reasonable precision.

uses
  Diagnostics;
var
  sw: TStopWatch;
begin
  sw := TStopWatch.StartNew;
  <dosomething>
  Writeln(Format('runtime: %d ms', [sw.ElapsedMilliseconds]));
end;

answered May 17 '11 at 13:27

Uwe Raabe

39,888
3
77
115

1

I don't have Delphi XE, but I believe that is the same class as here http://delphi.about.com/od/windowsshellapi/a/delphi-high-performance-timer-tstopwatch.htm – RBA May 17 '11 at 13:31
IN XE I would profile with AQ Time to understand exactly where performance matters. – May 17 '11 at 13:32
@RBA, the XE implementation is based on a record instead of a class, but the other stuff looks quite similar. – Uwe Raabe May 17 '11 at 15:51
1

the 'T' from the TStopWatch suggested me that is a class. good to know how is implemented in XE. – RBA May 18 '11 at 06:20
1

Why? The standard Delphi usage of 'T' is to indicate "Type", as in "this is a data type". Nearly all types start with a T but many of them are not classes. See TRect and TDateTime just to name a couple. – Jon Robertson Jan 23 '13 at 10:27

score 6 · Answer 5 · edited May 23 '17 at 12:17

6

I ask because I am currently trying to optimize a few functions

It is natural to think that measuring is how you find out what to optimize, but there's a better way.

If something takes a large enough fraction of time (F) to be worth optimizing, then if you simply pause it at random, F is the probability you will catch it in the act. Do that several times, and you will see precisely why it's doing it, down to the exact lines of code.

Calculating the speed of routines?

8 Answers8

Linked