Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

Question

I'm trying to understand how to measure performance and decided to write the very simple program:

section .text
    global _start

_start:
    mov rax, 60
    syscall

And I ran the program with perf stat ./bin The thing I was surprised by is the stalled-cycles-frontend was too high.

      0.038132      task-clock (msec)         #    0.148 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
             2      page-faults               #    0.052 M/sec                  
       107,386      cycles                    #    2.816 GHz                    
        81,229      stalled-cycles-frontend   #   75.64% frontend cycles idle   
        47,654      instructions              #    0.44  insn per cycle         
                                              #    1.70  stalled cycles per insn
         8,601      branches                  #  225.559 M/sec                  
           929      branch-misses             #   10.80% of all branches        

   0.000256994 seconds time elapsed

As I understand the stalled-cycles-frontend it means that CPU frontend has to wait for the result of some operation (e.g. bus-transaction) to complete.

So what caused CPU frontend to wait for most of the time in that simplest case?

And 2 page faults? Why? I read no memory pages.

How come you are not worried by the 47,654 instructions? :) Not sure what exactly is counted by perf, but presumably whatever else it was executing (kernel code?) is responsible for the stalls too. — Jester, Feb 15 '18 at 14:28
It might be interesting to `perf record` this program, and see if you can find where these counts happen. (IDK if you can set perf with lower thresholds so the counters will roll over and trigger an interrupt or a PEBS logging event often enough to get a count for anything, though.) — Peter Cordes, Feb 15 '18 at 14:32
Your `mov` instruction is not the first instruction that's executed by your process. I think you might want to take a look at this article: https://blogs.oracle.com/ksplice/hello-from-a-libc-free-world-part-1 This gives an overview on all the things that are happening before your own assembly code is executed, and how you can stop this stuff from happening. Quite an interesting read, actually. — cmaster - reinstate monica, Feb 15 '18 at 14:39
@cmaster I thought it was first. I ran `objdump -s ` on the binary and got this: `Contents of section .text: 400080 b83c0000 000f05`. At least, assembler did not insert something nasty. — St.Antario, Feb 15 '18 at 14:41
@cmaster what makes you think the OP isn't building a statically linked binary with those instructions at the ELF entry point? I think instruction counts would be higher if that was a dynamically linked binary (especially if it linked glibc so glibc's startup functions run), or a `main` called from CRT startup code. So yes, if you assemble those two instructions with nasm and link with ld, the only two instructions that run in user-space are `mov` and `syscall`. (But this would be a better question if it included the build commands so we'd know this.) — Peter Cordes, Feb 15 '18 at 14:43
Did you really write `section. text`, instead of `section .text` (and if so, what assembler accepted that)? BTW, it would make the question clearer if you included the `_start:` label, even though `ld` defaults to using the start of the `.text` section as the entry point if there's no `_start` symbol. — Peter Cordes, Feb 15 '18 at 14:47
It couldn't have existed in the first place if you'd copy/pasted your real code (and build instructions). Always do that instead of retyping code. Only type code into SO if you're making it up and *haven't* tested it anywhere. — Peter Cordes, Feb 15 '18 at 14:52
@PeterCordes It was precisely the missing `_start:` label that led me to think that the normal initialization is performed. Also the lack of a description in the question of how the libc linking was suppressed: It's pretty hard to generate an executable without this stuff if you are not explicitly trying to do so. Anyway, independent of whether my guess was correct or not, I'll not delete my comment: The link is really good, and likely interesting to other readers as well. — cmaster - reinstate monica, Feb 15 '18 at 15:11
@cmaster _It's pretty hard to generate an executable without this stuff if you are not explicitly trying to do so_ Do you mean the libc is stacially linked to all binaries by default? — St.Antario, Feb 15 '18 at 15:13
@St.Antario At least if you link with a C compiler, yes. On linux, the libc is linked *dynamically*, though, because the libc depends on the kernel headers. I'm not sure about how a naked linker invocation behaves, but the general gist is, that the linker first inserts some code to interpret the data structures that are passed in from the kernel (command line arguments and environment variables), then proceeds to initialize the libc, then loads all other dynamic libraries, and only then enter the programs `main()` function. Removing all these parts is not trivial (see the link I gave). — cmaster - reinstate monica, Feb 15 '18 at 16:37
There's a similar question https://stackoverflow.com/q/47724328/620382 . Please add more information on how you build the binary to get a more specific answer. The specific kernel version would also help. — Zulan, Feb 15 '18 at 17:03
@cmaster: `ld foo.o -o foo` makes a static executable on Linux. And no, the reason for dynamically linking is not a dependency on kernel headers (the user-space ABI is stable, and you don't need a different glibc to match your kernel). The reason is so libc isn't embedded into every binary, and glibc bugfixes don't require recompiling *everything*. Plus the usual memory advantages. With gcc, to make a static executable with a custom `_start`, use `gcc foo.o -nostdlib -static`. Or to dynamically link libc with your own `_start`, `gcc foo.o -nostartfiles` (glibc init funcs do still run). — Peter Cordes, Feb 15 '18 at 17:13
@cmaster: glibc init functions run using the same mechanism that runs constructors in dynamically-loaded C++ libraries: `libc.so.6` has the addresses of the necessary init functions in a `_global_ctors` array or something like that, which the dynamic linker invokes before jumping to `_start`. The linker only "inserts code to ... call `main`" (i.e. the CRT (C RunTime) start files) if gcc tells it do. If you run `gcc -v foo.c`, you'll see the compiler, assembler, and linker command lines (`ld` invoked via the `collect2` wrapper, but you can see the `crt0.o` passed explicitly by `gcc`) — Peter Cordes, Feb 15 '18 at 17:17
TL:DR: it is pretty trivial to make static executables on Linux, using `gcc -nostdlib` or using `ld` directly. See [a detailed example with full source + commands](https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all/44193770#44193770) showing the effect of `mov`-elimination on Intel Haswell/Skylake using `perf`. Fun fact: you can even write a `_start` in C and use inline asm to invoke `sys_exit` if you know exactly what you're doing and what you can and can't do from that context :P (I've done that once or twice for a microbenchmark). — Peter Cordes, Feb 15 '18 at 17:21

Peter Cordes · Answer 1 · 2018-02-15T14:45:56.993

2

Page faults includes code pages.

perf stat includes startup overhead.

IDK the details of how perf starts counting, but presumably it has to program the performance counters in kernel mode, so they're counting while the CPU switches back to user mode (stalling for many cycles, especially on a kernel with Meltdown defenses which invalidates the TLBs).

I guess most of the 47,654 instructions that were recorded was kernel code. Perhaps including the page-fault handler!

I guess your process never goes user->kernel->user, the whole process is kernel->user->kernel (startup, syscall to invoke sys_exit, then never returns to user-space), so there's never a case where the TLBs would have been hot anyway, except maybe when running inside the kernel after the sys_exit system call. And anyway, TLB misses aren't page faults, but this would explain lots of stalled cycles.

The user->kernel transition itself explains about 150 stalled cycles, BTW. syscall is faster than a cache miss (except it's not pipelined, and in fact flushes the whole pipeline; i.e. the privilege level is not renamed.)

edited Feb 15 '18 at 14:45

answered Feb 15 '18 at 14:39

Peter Cordes

245,674
35
423
606

This is a quick partial answer; I'd like to see an answer which explains exactly which (presumably kernel) instructions are included as part of what `perf stat` counts as startup / shutdown overhead. And also if you can control that; I think x86 perf counters can be programmed to only count in user mode, or kernel + user mode. – Peter Cordes Feb 15 '18 at 14:45
As you suggested I ran `perf record -e dTLB-load-misses`. It counted `10519` events. On top was the following line `35.55% perf [kernel.kallsyms]`. Does that mean that perf counting TLB miss overhead is significant in such simple case? – St.Antario Feb 15 '18 at 16:01
@St.Antario: apparently yes. You got 10k TLB misses for 47k instructions, though? That sounds implausibly high. Sounds like we need a real answer to this question, instead of just my hand-waving about `perf` including some startup overhead. – Peter Cordes Feb 15 '18 at 17:24
But if the `perf` startup overhead is too high... how can we write micro/nano-benchmarks using it? Is it even possible? – St.Antario Feb 16 '18 at 08:51
1

@St.Antario: put your code in a loop that runs for at least 50 milliseconds or something. I have a `testloop.asm` and an assemble + link + `perf stat` command line I can find easily that uses `mov ebp, 1000000000` as the loop counter for a `dec ebp / jnz` loop. I can add or remove zeros as needed if the code in the loop is more than a few fast uops. With `perf stat -r4`, I get extremely good repeatability and accuracy, with the expected results to within 1 part in 10k or 100k for run times of 0.1 to 1 second or so. See https://stackoverflow.com/questions/44169342/. – Peter Cordes Feb 16 '18 at 09:04
@St.Antario: An alternative for timing transitory effects (like the first run of a function while code cache is cold and branch prediction isn't trained) is to disable turbo and stuff, and use `rdtsc` to measure time. Or instead of using `perf` (which virtualizes the perf counters), use a lower-level library to access them directly. See [@ BeeOnRope's uarch-bench project](https://github.com/travisdowns/uarch-bench). With a kernel module to let you program the counters, you can then read them from userspace with `rdpmc` (if you enable permissions for that). – Peter Cordes Feb 16 '18 at 09:19
Or use Intel-PT to record a timestamp on each taken branch, so you can see whether there's variation in how long each loop iteration takes. But like `rdtsc`, that only gives you time, not any other interesting counters. – Peter Cordes Feb 16 '18 at 09:21

Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

1 Answers1

Linked