6

While I was writing a class for strings in C ++, I found a strange behavior regarding the speed of execution. I'll take as an example the following two implementations of the upper method:

class String {

    char* str;

    ...

    forceinline void upperStrlen();
    forceinline void upperPtr();
};

void String::upperStrlen()
{
    INDEX length = strlen(str);

    for (INDEX i = 0; i < length; i++) {
        str[i] = toupper(str[i]);
    }
}

void String::upperPtr()
{
    char* ptr_char = str;

    for (; *ptr_char != '\0'; ptr_char++) {
        *ptr_char = toupper(*ptr_char);
    }
}

INDEX is simple a typedef of uint_fast32_t.

Now I can test the speed of those methods in my main.cpp:

#define TEST_RECURSIVE(_function)                    \
{                                                    \
    bool ok = true;                                  \
    clock_t before = clock();                        \
    for (int i = 0; i < TEST_RECURSIVE_TIMES; i++) { \
        if (!(_function()) && ok)                    \
            ok = false;                              \
    }                                                \
    char output[TEST_RECURSIVE_OUTPUT_STR];          \
    sprintf(output, "[%s] Test %s %s: %ld ms\n",     \
        ok ? "OK" : "Failed",                        \
        TEST_RECURSIVE_BUILD_TYPE,                   \
        #_function,                                  \
        (clock() - before) * 1000 / CLOCKS_PER_SEC); \
    fprintf(stdout, output);                         \
    fprintf(file_log, output);                       \
}

String a;
String b;

bool stringUpperStrlen()
{
    a.upperStrlen();
    return true;
}

bool stringUpperPtr()
{
    b.upperPtr();
    return true;
}

int main(int argc, char** argv) {

    ...

    a = "Hello World!";
    b = "Hello World!";

    TEST_RECURSIVE(stringUpperPtr);
    TEST_RECURSIVE(stringUpperStrlen);

    ...

    return 0;
}

Then I can compile and test with cmake in Debug or Release with the following results.

[OK] Test RELEASE stringUpperPtr: 21 ms
[OK] Test RELEASE stringUpperStrlen: 12 ms

[OK] Test DEBUG stringUpperPtr: 27 ms
[OK] Test DEBUG stringUpperStrlen: 33 ms

So in Debug the behavior is what I expected, the pointer is faster than strlen, but in Release strlen is faster.

So I took the GCC assembly and the number of instructions is much less in the stringUpperPtr than in stringUpperStrlen.

The stringUpperStrlen assembly:

_Z17stringUpperStrlenv:
.LFB72:
    .cfi_startproc
    pushq   %r13
    .cfi_def_cfa_offset 16
    .cfi_offset 13, -16
    xorl    %eax, %eax
    pushq   %r12
    .cfi_def_cfa_offset 24
    .cfi_offset 12, -24
    pushq   %rbp
    .cfi_def_cfa_offset 32
    .cfi_offset 6, -32
    xorl    %ebp, %ebp
    pushq   %rbx
    .cfi_def_cfa_offset 40
    .cfi_offset 3, -40
    pushq   %rcx
    .cfi_def_cfa_offset 48
    orq $-1, %rcx
    movq    a@GOTPCREL(%rip), %r13
    movq    0(%r13), %rdi
    repnz scasb
    movq    %rcx, %rdx
    notq    %rdx
    leaq    -1(%rdx), %rbx
.L4:
    cmpq    %rbp, %rbx
    je  .L3
    movq    0(%r13), %r12
    addq    %rbp, %r12
    movsbl  (%r12), %edi
    incq    %rbp
    call    toupper@PLT
    movb    %al, (%r12)
    jmp .L4
.L3:
    popq    %rdx
    .cfi_def_cfa_offset 40
    popq    %rbx
    .cfi_def_cfa_offset 32
    popq    %rbp
    .cfi_def_cfa_offset 24
    popq    %r12
    .cfi_def_cfa_offset 16
    movb    $1, %al
    popq    %r13
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc
.LFE72:
    .size   _Z17stringUpperStrlenv, .-_Z17stringUpperStrlenv
    .globl  _Z14stringUpperPtrv
    .type   _Z14stringUpperPtrv, @function

The stringUpperPtr assembly:

_Z14stringUpperPtrv:
.LFB73:
    .cfi_startproc
    pushq   %rbx
    .cfi_def_cfa_offset 16
    .cfi_offset 3, -16
    movq    b@GOTPCREL(%rip), %rax
    movq    (%rax), %rbx
.L9:
    movsbl  (%rbx), %edi
    testb   %dil, %dil
    je  .L8
    call    toupper@PLT
    movb    %al, (%rbx)
    incq    %rbx
    jmp .L9
.L8:
    movb    $1, %al
    popq    %rbx
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc
.LFE73:
    .size   _Z14stringUpperPtrv, .-_Z14stringUpperPtrv
    .section    .rodata.str1.1,"aMS",@progbits,1

So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).

So how do you explain this difference in performance?

Thanks in advance.

EDIT: CMake generate something like this command to compile:

/bin/g++-8  -Os -DNDEBUG  -Wl,-rpath,$ORIGIN CMakeFiles/xpp-tests.dir/tests/main.cpp.o  -o xpp-tests  libxpp.so 
/bin/g++-8  -O3 -DNDEBUG  -Wl,-rpath,$ORIGIN CMakeFiles/xpp-tests.dir/tests/main.cpp.o  -o Release/xpp-tests  Release/libxpp.so 

# CMAKE generated file: DO NOT EDIT!
# Generated by "Unix Makefiles" Generator, CMake Version 3.16

# compile CXX with /bin/g++-8
CXX_FLAGS = -O3 -DNDEBUG   -Wall -pipe -fPIC -march=native -fno-strict-aliasing

CXX_DEFINES = -DPLATFORM_UNIX=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1

The define TEST_RECURSIVE will call _function 1000000 times in my examples.

phuclv
  • 27,258
  • 11
  • 104
  • 360
Enigma
  • 67
  • 6
  • 4
    When looking at timings and code generated, please include the command lines used. Note that timing without optimisation are essentially meaningless as compilers are at liberty to generate extra code to help debugging and thus generate much slower code. – Richard Critten Aug 21 '20 at 19:50
  • 1
    I'd expect the `strlen` to chew up your time, but a smart compiler might go, "meh. Constant string." and then replace `strlen` with a constant. – user4581301 Aug 21 '20 at 19:56
  • I know where the next comment is going. – user4581301 Aug 21 '20 at 19:57
  • `So rationally, fewer instructions should mean more speed` This is a decent approximation. But it isn't something that you can assume hold in every case. Some instructions can be far more expensive than others. – eerorika Aug 21 '20 at 19:57
  • 3
    *"So rationally, fewer instructions should mean more speed..."*: This has been far too naive for more than a decade now. A single memory cache hit requiring access to DRAM is on the order of 200 cycles (current PC or laptop). An incorrectly predicted branch is super costly. Certain dependencies between consecutive instructions is costly etc. – Codo Aug 21 '20 at 20:09
  • Have you tried swapping the order around, and calling `TEST_RECURSIVE(stringUpperStrlen);` first? – 1201ProgramAlarm Aug 21 '20 at 20:10
  • @user4581301 The method uses a class copy char * not const char * ("Hello World!"), does this make any difference? – Enigma Aug 21 '20 at 20:11
  • @1201ProgramAlarm I've tried this before, but the result is the same. – Enigma Aug 21 '20 at 20:12
  • 2
    Performance profiling debug code is irrelevant. – Eljay Aug 21 '20 at 20:12
  • *So rationally, fewer instructions should mean more speed* -- This is not true, and not even in the assembly phase. A compiler may optimize code for speed, and the resulting executable is *larger* in size than a non-optimized version. – PaulMcKenzie Aug 21 '20 at 20:16
  • @PaulMcKenzie Understandable, but I can't figure out where the difference between upperStrlen and upperPtr. – Enigma Aug 21 '20 at 20:23
  • `"Hello World!"` is constant. An optimizing compiler may be able to make use of this knowledge all the way through the program, simplifying and sometime even eliminating whole swaths of code. Yesterday or the day before we had a asker with code that, once the compiler was done, resolved down to a `printf` displaying a constant. All of the loops, math, pointer-play was all resolved at compile time. – user4581301 Aug 21 '20 at 20:53
  • 3
    @Enigma -- In general, when you have C++ code (or C code) that uses too many pointers (as your implementation does), the code becomes slower than if you hadn't used pointers. The reason is that overly using pointers in the code renders the compiler's optimizations useless due to aliasing issues that the optimizer cannot resolve easily. This is why trying to beat the compiler at the optimization game by using pointers usually doesn't work. It may have worked decades ago, but not in this day and age of optimizing compilers. – PaulMcKenzie Aug 21 '20 at 20:58
  • You might want to test with a string that takes up more than one cache line. – Davis Herring Aug 22 '20 at 01:06

1 Answers1

3

You have several misconceptions about performance. You need to dispel these misconceptions.

Now I can test the speed of those methods in my main.cpp: (…)

Your benchmarking code calls the benchmarked functions directly. So you're measuring the benchmarked functions as optimized for the specific case of how they're used by the benchmarking code: to call them repeatedly on the same input. This is unlikely to have any relevance to how they behave in a realistic environment.

I think the compiler didn't do anything earth-shattering because it doesn't know what toupper does. If the compiler had known that toupper doesn't transform a nonzero character into zero, it might well have hoisted the strlen call outside the benchmarked loop. And if it had known that toupper(toupper(x)) == toupper(x), it might well have decided to run the loop only once.

To make a somewhat realistic benchmark, put the benchmarked code and the benchmarking code in separate source files, compile them separately, and disable any kind of cross-module or link-time optimization.

Then I can compile and test with cmake in Debug or Release

Compiling in debug mode rarely has any relevance to microbenchmarks (benchmarking the speed of an implementation of a small fragment of code, as opposed to benchmarking the relative speed of algorithms in terms of how many elementary functions they call). Compiler optimizations have a significant effect on microbenchmarks.

So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).

No, absolutely not.

First of all, fewer instructions total is completely irrelevant to the speed of the program. Even on a platform where executing one instruction takes the same amount of time regardless of what the instruction is, which is unusual, what matters is how many instructions are executed, not how many instructions there are in the program. For example, a loop with 100 instructions that is executed 10 times is 10 times faster than a loop with 10 instructions that is executed 1000 times, even though it's 10 times larger. Inlining is a common program transformation that usually makes the code larger and makes it faster often enough that it's considered a common optimization.

Second, on many platforms, such as any PC or server made in the 21st century, any smartphone, and even many lower-end devices, the time it takes to execute an instruction can vary so widely that it's a poor indication of performance. Cache is a major factor: a read from memory can be more than 1000 times slower than a read from cache on a PC. Other factors with less impact include pipelining, which causes the speed of an instruction to depend on the surrounding instructions, and branch prediction, which causes the speed of a conditional instruction to depend on the outcome of previous conditional instructions.

Third, that's just considering processor instructions — what you see in assembly code. Compilers for C, C++ and most other languages optimize programs in such a way that it can be hard to predict what the processor will be doing exactly.

For example, how long does the instruction ++x; take on a PC?

  • If the compiler has figured out that the addition is unnecessary, for example because nothing uses x afterwards, or because the value of x is known at compile time and therefore so is the value of x+1, it'll optimize it away. So the answer is 0.
  • If the value of x is already in a register at this point and the value is only needed in a register afterwards, the compiler just needs to generate an addition or increment instruction. So the simplistic, but not quite correct answer is 1 clock cycle. One reason this is not quite correct is that merely decoding the instruction takes many cycles on a high-end processor such as what you find in a 21st century PC or smartphone. However “one cycle” is kind of correct in that while it takes multiple clock cycles from starting the instruction to finishing it, the instruction only takes one cycle in each pipeline stage. Furthermore, even taking this into account, another reason this is not quite correct is that ++x; ++y; might not take 2 clock cycles: modern processors are sophisticated enough that they may be able to decode and execute multiple instructions in parallel (for example, a processor with 4 arithmetic units can perform 4 additions at the same time). Yet another reason this might not be correct is if the type of x is larger or smaller than a register, which might require more than one assembly instruction to perform the addition.
  • If the value of x needs to be loaded from memory, this takes a lot more than one clock cycle. Anything other than the innermost cache level dwarfs the time it takes to decode the instruction and perform the addition. The amount of time is very different depending on whether x is found in the L3 cache, in the L2 cache, in the L1 cache, or in the “real” RAM. And even that gets more complicated when you consider that x might be part of a cache prefetch (hardware- or software- triggered).
  • It's even possible that x is currently in swap, so that reading it requires reading from a disk.
  • And writing the result exhibits somewhat similar variations to reading the input. However the performance characteristics are different for reads and for writes because when you need a value, you need to wait for the read to be complete, whereas when you write a value, you don't need to wait for the write to be complete: a write to memory writes to a buffer in cache, and the time when the buffer is flushed to a higher-level cache or to RAM depends on what else is happening on the system (what else is competing for space in the cache).

Ok, now let's turn to your specific example and look at what happens in their inner loop. I'm not very familiar with x86 assembly but I think I get the gist.

For stringUpperStrlen, the inner loop starts at .L4. Just before entering the inner loop, %rbx is set to the length of the string. Here's what the inner loop contains:

  • cmpq %rbp, %rbx: Compare the current index to the length, both obtained from registers.
  • je .L3: conditional jump, to exit the loop if the index is equal to the length.
  • movq 0(%r13), %r12: Read from memory to get the address of the beginning of the string. (I'm surprised that the address isn't in a register at this point.)
  • addq %rbp, %r12: an arithmetic operation that depends on the value that was just read from memory.
  • movsbl (%r12), %edi: Read the current character from the string in memory.
  • incq %rbp: Increment the index. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
  • call toupper@PLT
  • movb %al, (%r12): Write the value returned by the function to the current character of the string in memory.
  • jmp .L4: Unconditional jump to the beginning of the loop.

For stringUpperPtr, the inner loop starts at .L9. Here's what the inner loop contains:

  • movsbl (%rbx), %edi: read from the address containing the current.
  • testb %dil, %dil: test if %dil is zero. %dil is the least significant byte of %edi which was just read from memory.
  • je .L8: conditional jump, to exit the loop if the character is zero.
  • call toupper@PLT
  • movb %al, (%rbx): Write the value returned by the function to the current character of the string in memory.
  • incq %rbx: Increment the pointer. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
  • jmp .L9: Unconditional jump to the beginning of the loop.

The differences between the two loops are:

  • The loops have slightly different lengths, but both are small enough that they fit in a single cache line (or two, if the code happens to straddle a line boundary). So after the first iteration of the loop, the code will be in the innermost instruction cache. Not only that, but if I understand correctly, on modern Intel processors, there is a cache of decoded instructions, which the loop is small enough to fit in, and so no decoding needs to take place.
  • The stringUpperStrlen loop has one more read. The extra read is from a constant address which is likely to remain in the innermost cache after the first iteration.
  • The conditional instruction in the stringUpperStrlen loop depends only on values that are in registers. On the other hand, the conditional instruction in the stringUpperPtr loop depends on a value which was just read from memory.

So the difference boils down to an extra data read from the innermost cache, vs having a conditional instruction whose outcome depends on a memory read. An instruction whose outcome depends on the result of another instruction leads to a hazard: the second instruction is blocked until the first instruction is fully executed, which prevents taking advantage from pipelining, and can render speculative execution less effective. In the stringUpperStrlen loop, the processor essentially runs two things in parallel: the load-call-store cycle, which doesn't have any conditional instructions (apart from what happens inside toupper), and the increment-test cycle, which doesn't access memory. This lets the processor work on the conditional instruction while it's waiting for memory. In the stringUpperPtr loop, the conditional instruction depends on a memory read, so the processor can't start working on it until the read is complete. I'd typically expect this to be slower than the extra read from the innermost cache, although it might depend on the processor.

Of course, the stringUpperStrlen does need to have a load-test hazard to determine the end of the string: no matter how it does it, it needs to fetch characters in memory. This is hidden inside repnz scasb. I don't know the internal architecture of an x86 processor, but I suspect that this case (which is extremely common since it's the meat of strlen) is heavily optimized inside the processor, probably to an extent that is impossible to reach with generic instructions.

You may see different results if the string was longer and the two memory accesses in stringUpperStrlen weren't in the same cache line, although possibly not because this only costs one more cache line and there are several. The details would depend on how the caches work and how toupper uses them.

Gilles 'SO- stop being evil'
  • 92,660
  • 35
  • 189
  • 229
  • Actually `repnz scasb` is garbage for performance with long strings, and not good with short strings. Only unconditional `rep movs` and `rep stos` (memset / memcpy respectively) have optimized microcode that goes 16 or 32 bytes at a time; scasb and cmpsb only go 1 count at a time, in this case 1 byte at a time. GCC probably shouldn't be using `repnz scasb`; calling libc `strlen` might actually be faster. It can check alignment and then use SIMD `pcmpeqb` / `pmovmskb` / `bsf`. – Peter Cordes Aug 24 '20 at 11:46
  • Perhaps @Enigma was compiling with `-O1` instead of full optimization (`-O3`) for the "release" tests? That would match GCC's poor choices for expanding builtin strlen: [Why is this code 6.5x slower with optimizations enabled?](https://stackoverflow.com/q/55563598) has details, see my answer. But we can see from the listed build commands that the source was built with `-O3` (and `-fno-strict-aliasing`??). But then linked with `-Os`. If there was link-time optimization / code-gen with `-flto` then `-Os` would explain it, but `-O3` shouldn't be using `repnz scasb`. – Peter Cordes Aug 24 '20 at 11:52
  • You already make this point, but it's a good idea to emphasize that *how long does the instruction ++x; take on a PC?* truly doesn't have an answer even if you know the context. Superscalar out-of-order execution means that the performance of a single instruction is not a single number that you can add up. There are 3 dimensions: latency, front-end uops, and back-end ports. [Those factors add up separately](https://stackoverflow.com/q/51607391), and whichever one is the biggest bottleneck for a loop forms the critical path. (Or for short loops, you have to consider their surrounding code) – Peter Cordes Aug 24 '20 at 11:58
  • @PeterCordes Thank you for the extra information about x86. Regarding how long an instruction takes, yet another factor is hyperthreading: the time an instruction takes can depend on whether the other thread is currently using some shared resource. – Gilles 'SO- stop being evil' Aug 24 '20 at 12:10
  • Yeah the more I look at this asm, the more it looks ilke `-O1` or `-Os` output, not the `-O3` the question implies. GCC doesn't suck, it's not going to put an unconditional `jmp .L3` at the bottom of the loop when it could rearrange things to only have a conditional branch at the bottom. And it's not going to expand strlen to a potentially disastrously slow `repnz scasb` at `-O3`. Also `-O3` will inline `toupper` and hoist its table setup. But I can reproduce those code-gen choices with `-Os` on Godbolt: https://godbolt.org/z/3bbnn6 – Peter Cordes Aug 24 '20 at 12:17
  • *the second instruction is blocked until the first instruction is fully executed,* - only for data dependencies, not for control dependencies. Branch prediction + speculative execution *do* execute past conditional branch instructions in OoO exec CPUs. I think what's going on here is that the strings are short enough that [Avoid stalling pipeline by calculating conditional early](https://stackoverflow.com/q/49932119) is relevant: slow but branchless `repnz scasb` can't mispredict, and then OoO exec can run ahead and predict the loop-exit branch being taken before other exec "gets there". – Peter Cordes Aug 24 '20 at 12:26
  • So re: `-O3` vs. `-Os`, probably @Enigma benchmarked with `-O3` using the build commands shown, but the asm source in the question was generated with different build options, namely `-Os`. (It's clearly `-S` compiler output, not disassembly, which would be fine if it was using the same options the binary was compiled with (not linked with)). – Peter Cordes Aug 24 '20 at 12:41
  • 1
    @PeterCordes Yes, I was using `-O3` in Release mode and `-Os + -S` assembly because I was leaving CMake to compile the code, now I changed it, however I discovered the `-foptimize-strlen` option which explains the performance difference. I think optimizations are magic tricks and I'm trying to figure out when the code is optimized and when not LOL – Enigma Aug 25 '20 at 15:37
  • @Enigma: CPUs can't run C directly, only machine code / asm, so compilers [always have to make choices](https://stackoverflow.com/questions/33278757/disable-all-optimization-options-in-gcc/33284629#33284629) when transforming the source into something the CPU can run. Benchmarking with `-O0` / unoptimized code wouldn't be useful here; less optimization is only useful for debugging your program, or debugging the compiler. Comparing `-Os` vs `-O3` vs. `-O2` can sometimes be interesting, especially for GCC where `-O2` doesn't enable auto-vectorization. – Peter Cordes Aug 25 '20 at 20:16