From Delphi 6 upwards you can use the x86 Timestamp counter.
This counts CPU cycles, on a 1 Ghz processor, each count takes one nanosecond.
Can't get more accurate than that.
function RDTSC: Int64; assembler;
asm
// RDTSC can be executed out of order, so the pipeline needs to be flushed
// to prevent RDTSC from executing before your code is finished.
// Flush the pipeline
XOR eax, eax
PUSH EBX
CPUID
POP EBX
RDTSC //Get the CPU's time stamp counter.
end;
On x64 the following code is more accurate, because it does not suffer from the delay of CPUID
.
rdtscp // On x64 we can use the serializing version of RDTSC
push rbx // Serialize the code after, to avoid OoO sneaking in
push rax // subsequent instructions prior to executing RDTSCP.
push rdx // See: http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
xor eax,eax
cpuid
pop rdx
pop rax
pop rbx
shl rdx,32
or rax,rdx
Use the above code to get the timestamp before and after executing your code.
Most accurate method possible and easy as pie.
Note that you need to run a test at least 10 times to get a good result, on the first pass the cache will be cold, and random harddisk reads and interrupts can throw off your timings.
Because this thing is so accurate it can give you the wrong idea if you only time the first run.
Why you should not use QueryPerformanceCounter()
QueryPerformanceCounter()
gives the same amount of time if the CPU slows down, it compensates for CPU thottling. Whilst RDTSC will give you the same amount of cycles if your CPU slows down due to overheating or whatnot.
So if your CPU starts running hot and needs to throttle down, QueryPerformanceCounter()
will say that your routine is taking more time (which is misleading) and RDTSC will say that it takes the same amount of cycles (which is accurate).
This is what you want because you're interested in the amount of CPU-cycles your code uses, not the wall-clock time.
From the lastest intel docs: http://software.intel.com/en-us/articles/measure-code-sections-using-the-enhanced-timer/?wapkw=%28rdtsc%29
Using the Processor Clocks
This timer is very accurate. On a system with a 3GHz processor, this timer can measure events that last less than one nanosecond. [...] If the frequency changes while the targeted code is running, the final reading will be redundant since the initial and final readings were not taken using the same clock frequency. The number of clock ticks that occurred during this time will be accurate, but the elapsed time will be an unknown.
When not to use RDTSC
RDTSC is useful for basic timing. If you're timing multithreaded code on a single CPU machine, RDTSC will work fine. If you have multiple CPU's the startcount may come from one CPU and the endcount from another.
So don't use RDTSC to time multithreaded code on a multi-CPU machine. On a single CPU machine it works fine, or single threaded code on a multi-CPU machine it is also fine.
Also remember that RDTSC counts CPU cycles. If there is something that takes time but doesn't use the CPU, like disk-IO or network than RDTSC is not a good tool.
But the documentation says RDTSC is not accurate on modern CPU's
RDTSC is not a tool for keeping track of time, it's a tool for keeping track of CPU-cycles.
For that it is the only tool that is accurate. Routines that keep track of time are not accurate on modern CPU's because the CPU-clock is not absolute like it used to be.