You have several misconceptions about performance. You need to dispel these misconceptions.
Now I can test the speed of those methods in my main.cpp: (…)
Your benchmarking code calls the benchmarked functions directly. So you're measuring the benchmarked functions as optimized for the specific case of how they're used by the benchmarking code: to call them repeatedly on the same input. This is unlikely to have any relevance to how they behave in a realistic environment.
I think the compiler didn't do anything earth-shattering because it doesn't know what toupper
does. If the compiler had known that toupper
doesn't transform a nonzero character into zero, it might well have hoisted the strlen
call outside the benchmarked loop. And if it had known that toupper(toupper(x)) == toupper(x)
, it might well have decided to run the loop only once.
To make a somewhat realistic benchmark, put the benchmarked code and the benchmarking code in separate source files, compile them separately, and disable any kind of cross-module or link-time optimization.
Then I can compile and test with cmake in Debug or Release
Compiling in debug mode rarely has any relevance to microbenchmarks (benchmarking the speed of an implementation of a small fragment of code, as opposed to benchmarking the relative speed of algorithms in terms of how many elementary functions they call). Compiler optimizations have a significant effect on microbenchmarks.
So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).
No, absolutely not.
First of all, fewer instructions total is completely irrelevant to the speed of the program. Even on a platform where executing one instruction takes the same amount of time regardless of what the instruction is, which is unusual, what matters is how many instructions are executed, not how many instructions there are in the program. For example, a loop with 100 instructions that is executed 10 times is 10 times faster than a loop with 10 instructions that is executed 1000 times, even though it's 10 times larger. Inlining is a common program transformation that usually makes the code larger and makes it faster often enough that it's considered a common optimization.
Second, on many platforms, such as any PC or server made in the 21st century, any smartphone, and even many lower-end devices, the time it takes to execute an instruction can vary so widely that it's a poor indication of performance. Cache is a major factor: a read from memory can be more than 1000 times slower than a read from cache on a PC. Other factors with less impact include pipelining, which causes the speed of an instruction to depend on the surrounding instructions, and branch prediction, which causes the speed of a conditional instruction to depend on the outcome of previous conditional instructions.
Third, that's just considering processor instructions — what you see in assembly code. Compilers for C, C++ and most other languages optimize programs in such a way that it can be hard to predict what the processor will be doing exactly.
For example, how long does the instruction ++x;
take on a PC?
- If the compiler has figured out that the addition is unnecessary, for example because nothing uses
x
afterwards, or because the value of x
is known at compile time and therefore so is the value of x+1
, it'll optimize it away. So the answer is 0.
- If the value of
x
is already in a register at this point and the value is only needed in a register afterwards, the compiler just needs to generate an addition or increment instruction. So the simplistic, but not quite correct answer is 1 clock cycle. One reason this is not quite correct is that merely decoding the instruction takes many cycles on a high-end processor such as what you find in a 21st century PC or smartphone. However “one cycle” is kind of correct in that while it takes multiple clock cycles from starting the instruction to finishing it, the instruction only takes one cycle in each pipeline stage. Furthermore, even taking this into account, another reason this is not quite correct is that ++x; ++y;
might not take 2 clock cycles: modern processors are sophisticated enough that they may be able to decode and execute multiple instructions in parallel (for example, a processor with 4 arithmetic units can perform 4 additions at the same time). Yet another reason this might not be correct is if the type of x
is larger or smaller than a register, which might require more than one assembly instruction to perform the addition.
- If the value of
x
needs to be loaded from memory, this takes a lot more than one clock cycle. Anything other than the innermost cache level dwarfs the time it takes to decode the instruction and perform the addition. The amount of time is very different depending on whether x
is found in the L3 cache, in the L2 cache, in the L1 cache, or in the “real” RAM. And even that gets more complicated when you consider that x
might be part of a cache prefetch (hardware- or software- triggered).
- It's even possible that
x
is currently in swap, so that reading it requires reading from a disk.
- And writing the result exhibits somewhat similar variations to reading the input. However the performance characteristics are different for reads and for writes because when you need a value, you need to wait for the read to be complete, whereas when you write a value, you don't need to wait for the write to be complete: a write to memory writes to a buffer in cache, and the time when the buffer is flushed to a higher-level cache or to RAM depends on what else is happening on the system (what else is competing for space in the cache).
Ok, now let's turn to your specific example and look at what happens in their inner loop. I'm not very familiar with x86 assembly but I think I get the gist.
For stringUpperStrlen
, the inner loop starts at .L4
. Just before entering the inner loop, %rbx
is set to the length of the string. Here's what the inner loop contains:
cmpq %rbp, %rbx
: Compare the current index to the length, both obtained from registers.
je .L3
: conditional jump, to exit the loop if the index is equal to the length.
movq 0(%r13), %r12
: Read from memory to get the address of the beginning of the string. (I'm surprised that the address isn't in a register at this point.)
addq %rbp, %r12
: an arithmetic operation that depends on the value that was just read from memory.
movsbl (%r12), %edi
: Read the current character from the string in memory.
incq %rbp
: Increment the index. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
call toupper@PLT
movb %al, (%r12)
: Write the value returned by the function to the current character of the string in memory.
jmp .L4
: Unconditional jump to the beginning of the loop.
For stringUpperPtr
, the inner loop starts at .L9
. Here's what the inner loop contains:
movsbl (%rbx), %edi
: read from the address containing the current.
testb %dil, %dil
: test if %dil
is zero. %dil
is the least significant byte of %edi
which was just read from memory.
je .L8
: conditional jump, to exit the loop if the character is zero.
call toupper@PLT
movb %al, (%rbx)
: Write the value returned by the function to the current character of the string in memory.
incq %rbx
: Increment the pointer. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
jmp .L9
: Unconditional jump to the beginning of the loop.
The differences between the two loops are:
- The loops have slightly different lengths, but both are small enough that they fit in a single cache line (or two, if the code happens to straddle a line boundary). So after the first iteration of the loop, the code will be in the innermost instruction cache. Not only that, but if I understand correctly, on modern Intel processors, there is a cache of decoded instructions, which the loop is small enough to fit in, and so no decoding needs to take place.
- The
stringUpperStrlen
loop has one more read. The extra read is from a constant address which is likely to remain in the innermost cache after the first iteration.
- The conditional instruction in the
stringUpperStrlen
loop depends only on values that are in registers. On the other hand, the conditional instruction in the stringUpperPtr
loop depends on a value which was just read from memory.
So the difference boils down to an extra data read from the innermost cache, vs having a conditional instruction whose outcome depends on a memory read. An instruction whose outcome depends on the result of another instruction leads to a hazard: the second instruction is blocked until the first instruction is fully executed, which prevents taking advantage from pipelining, and can render speculative execution less effective. In the stringUpperStrlen
loop, the processor essentially runs two things in parallel: the load-call-store cycle, which doesn't have any conditional instructions (apart from what happens inside toupper
), and the increment-test cycle, which doesn't access memory. This lets the processor work on the conditional instruction while it's waiting for memory. In the stringUpperPtr
loop, the conditional instruction depends on a memory read, so the processor can't start working on it until the read is complete. I'd typically expect this to be slower than the extra read from the innermost cache, although it might depend on the processor.
Of course, the stringUpperStrlen
does need to have a load-test hazard to determine the end of the string: no matter how it does it, it needs to fetch characters in memory. This is hidden inside repnz scasb
. I don't know the internal architecture of an x86 processor, but I suspect that this case (which is extremely common since it's the meat of strlen
) is heavily optimized inside the processor, probably to an extent that is impossible to reach with generic instructions.
You may see different results if the string was longer and the two memory accesses in stringUpperStrlen
weren't in the same cache line, although possibly not because this only costs one more cache line and there are several. The details would depend on how the caches work and how toupper
uses them.