rdpmc: surprising behavior

Question

I'm trying to understand the rdpmc instruction. As such I have the following asm code:

segment .text
global _start

_start:
    xor eax, eax
    mov ebx, 10
.loop:
    dec ebx
    jnz .loop

    mov ecx, 1<<30
    ; calling rdpmc with ecx = (1<<30) gives number of retired instructions
    rdpmc
    ; but only if you do a bizarre incantation: (Why u do dis Intel?)
    shl rdx, 32
    or  rax, rdx

    mov rdi, rax ; return number of instructions retired.
    mov eax, 60
    syscall

(The implementation is a translation of rdpmc_instructions().) I count that this code should execute 2*ebx+3 instructions before hitting the rdpmc instruction, so I expect (in this case) that I should get a return status of 23.

If I run perf stat -e instruction:u ./a.out on this binary, perf tells me that I've executed 30 instructions, which looks about right. But if I execute the binary, I get a return status of 58, or 0, not deterministic.

What have I done wrong here?

You can't get a return status of 306 because only least significant 8 bits of the exit value are returned to the parent process. — Ross Ridge, May 17 '19 at 20:03
Have you tried counting a delta between entry to `_start` vs. at the end? Have you tried increasing the iteration count to see if the result varies with instructions executed *at all*? — Peter Cordes, May 17 '19 at 20:04
code review: a better translation of `for(i=0 ; i<1000; i++)` is a mov-immediate to register with the loop counter. Or `cmp eax, 1000`. Using `a dq 100` is just clutter; inline small read-only constants. (Use `equ` if you still want the definition ahead of code). The correct translation of `1<<30` is `mov ecx, 1<<30`, not a runtime shift. A more efficient loop structure is `dec ebx / jnz .loop`. `rdpmc` writes EAX and EDX, implicitly zero-extending into RAX and RDX, you don't need to zero them first. Also, you might as well ignore RDX unless it's possible for the count to be > 2^32. — Peter Cordes, May 17 '19 at 20:10
Also don't forget to use `default rel` so `[a]` uses a RIP-relative addressing mode. (Unless you're trying to experimenting with the difference between rel and abs addressing modes). — Peter Cordes, May 17 '19 at 20:12
Also, if you don't do anything special to reset the performance counter before your program runs, there's no reason to expect it to start counting from zero. That's the point of using `perf`. But you could take a delta. — Peter Cordes, May 17 '19 at 20:23
@RossRidge: Edited to make sure that the number of instruction is less than 256. — user14717, May 17 '19 at 20:24
Your forgot to update your text, so now it doesn't match the code. Like I just commented, try taking a delta because the counter probably starts at some arbitrary 64-bit value. — Peter Cordes, May 17 '19 at 20:25
@PeterCordes: Thanks! It's hard to find people who have good taste in assembly, so this was very helpful. I've tried commenting out the loop; I sometimes get 58, sometime 0. Result is not deterministic. — user14717, May 17 '19 at 20:25
@PeterCordes: Tried taking the delta, now I'm getting zero identically every time. — user14717, May 17 '19 at 20:28
Then probably the counter isn't enabled. Try running your program under `perf`, so it's profiling itself as well as being profiled by `perf`. That should get `perf` to have the fixed counters enabled. — Peter Cordes, May 17 '19 at 20:37
@PeterCordes: When I run it under perf, I get 27 instructions, deterministically, which is about right. — user14717, May 17 '19 at 20:44
Cool, that confirms my guess :) The fixed counters are only counting when enabled. — Peter Cordes, May 17 '19 at 20:47
@PeterCordes: So ostensibly there's some CPU flag that needs to be set to get the counters to operate? — user14717, May 17 '19 at 21:15
@user14717 The IA32_PERF_GLOBAL_CTRL and IA32_FIXED_CTR_CTRL MSRs have to be modified (see Chapter 18 in Volume 3 of Intel's "Software Developer’s Manual"). — Andreas Abel, May 17 '19 at 22:52
@AndreasAbel: Could you edit Peter's answer to set the correct bits in that register so we can have an authoritative answer to this question? I think it is of general interest. — user14717, May 17 '19 at 23:55

score 5 · Accepted Answer · edited May 18 '19 at 01:47

The fixed counters don't count all the time, only when software has enabled them. Normally (the kernel side of) perf does this, along with resetting them to zero before starting a program.

The fixed counters (like the programmable counters) have bits that control whether they count in user, kernel, or user+kernel (i.e. always). I assume Linux's perf kernel code leaves them set to count neither when nothing is using them.

If you want to use raw RDPMC yourself, you need to either program / enable the counters (by setting the corresponding bits in the IA32_PERF_GLOBAL_CTRL and IA32_FIXED_CTR_CTRL MSRs), or get perf to do it for you by still running your program under perf. e.g. perf stat ./a.out

If you use perf stat -e instructions:u ./perf ; echo $?, the fixed counter will actually be zeroed before entering your code so you get consistent results from using rdpmc once. Otherwise, e.g. with the default -e instructions (not :u) you don't know the initial value of the counter. You can fix that by taking a delta, reading the counter once at start, then once after your loop.

The exit status is only 8 bits wide, so this little hack to avoid printf or write() only works for very small counts.

It also means its pointless to construct the full 64-bit rdpmc result: the high 32 bits of the inputs don't affect the low 8 bits of a sub result because carry propagates only from low to high. In general, unless you expect counts > 2^32, just use the EAX result. Even if the raw 64-bit counter wrapped around during the interval you measured, your subtraction result will still be a correct small integer in a 32-bit register.

Simplified even more than in your question. Also note indenting the operands so they can stay at a consistent column even for mnemonics longer than 3 letters.

segment .text
global _start

_start:
    mov   ecx, 1<<30      ; fixed counter: instructions
    rdpmc
    mov   edi, eax        ; start

    mov   edx, 10
.loop:
    dec   edx
    jnz   .loop

    rdpmc               ; ecx = same counter as before

    sub   eax, edi       ; end - start

    mov   edi, eax
    mov   eax, 231
    syscall             ; sys_exit_group(rdpmc).  sys_exit isn't wrong, but glibc uses exit_group.

Running this under perf stat ./a.out or perf stat -e instructions:u ./a.out, we always get 23 from echo $? (instructions:u shows 30, which is 1 more than the actual number of instructions this program runs, including syscall)

23 instructions is exactly the number of instructions strictly after the first rdpmc, but including the 2nd rdpmc.

If we comment out the first rdpmc and run it under perf stat -e instructions:u, we consistently get 26 as the exit status, and 29 from perf. rdpmc is the 24th instruction to be executed. (And RAX starts out initialized to zero because this is a Linux static executable, so the dynamic linker didn't run before _start). I wonder if the sysret in the kernel gets counted as a "user" instruction.

But with the first rdpmc commented out, running under perf stat -e instructions (not :u) gives arbitrary values as the starting value of the counter isn't fixed. So we're just taking (some arbitrary starting point + 26) mod 256 as the exit status.

But note that RDPMC is not a serializing instruction, and can execute out of order. In general you maybe need lfence, or (as John McCalpin suggests in the thread you linked) giving ECX a false dependency on the results of instructions you care about. e.g. and ecx, 0 / or ecx, 1<<30 works, because unlike xor-zeroing, and ecx,0 is not dependency-breaking.

Nothing weird happens in this program because the front-end is the only bottleneck, so all the instructions execute basically as soon as they're issued. Also, the rdpmc is right after the loop, so probably a branch mispredict of the loop-exit branch prevents it from being issued into the OoO back-end before the loop finishes.

PS for future readers: one way to enable user-space RDPMC on Linux without any custom modules beyond what perf requires is documented in perf_event_open(2):

echo 2 | sudo tee /sys/devices/cpu/rdpmc    # enable RDPMC always, not just when a perf event is open

This instruction is somewhat strange in that it doesn't segfault when counters aren't enabled, it just. . . does the wrong thing. Also, I can't find anything in the Intel manual saying what needs to be done to get counters to run. — user14717, May 17 '19 at 22:00
Note that `rdpmc` is not a serializing instruction. To get reliable results, it has to be sandwiched between serializing instructions such as `lfence`. — Andreas Abel, May 17 '19 at 22:10
@AndreasAbel ah good point. This program doesn't include any bottlenecks other than the front-end, so instructions are all going to execute as quickly as their uops enter the out-of-order back end. And the branch miss on loop exit probably helps. One of John McCalpin's posts on [the thread the OP linked](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214) includes the idea of giving ECX a false dependency on the result of code you want to measure. (e.g. `and ecx,0` (not dep-breaking) / `or ecx, 1<<30`). — Peter Cordes, May 17 '19 at 22:21
@PeterCordes But this wouldn't prevent later instructions (in this example, e.g., `mov eax, 60`) from potentially being executed before `rdpmc`. — Andreas Abel, May 17 '19 at 22:34
@AndreasAbel: That's true, so you might still want `lfence` *after* `rdpmc`, even if you use that trick to avoid one before. But in this case we don't have to worry about later instructions: they can't *retire* before `rdpmc` executes, because retirement is in-order. The `1<<30` fixed counter counts `inst_retired.any`, IIRC. — Peter Cordes, May 17 '19 at 22:50
OK, but then there is, in general, no guarantee that the earlier instructions have retired by the time `rdpmc` is executed, so the trick with the false dependency on ECX doesn't seem to be correct. — Andreas Abel, May 17 '19 at 22:59
@AndreasAbel: That's true. It maybe makes sense for a counter like `uops_executed.thread` or dispatched, rather than a retirement event counter. Or a cycle counter, not instructions/uops. Or for being approximately in the right place for memory events. But if it's at the end of a long dependency chain, if *all* older instructions have to have executed for a result to be ready (given limited ROB size), then it could still be useful. OTOH, if you already know that there are few to no in-flight uops, then you might as well just use `lfence` unless you're counting cycles and don't want overhead — Peter Cordes, May 17 '19 at 23:23
@PeterCordes What if no `IA32_PERFEVTSEL` is programmed to count a specific perf event one set as an `rdpmc` operand. — Some Name, Apr 30 '20 at 21:33
@SomeName: I really don't know. I assume the event counter doesn't increment and you'll get the same output every time from `rdpmc`. Possibly always `0`, IDK. — Peter Cordes, May 01 '20 at 00:19
@PeterCordes fixed counter returns crap for me (it's enabled and i tried 1 << 30 and I also tried 1 << 32 seeing as thats the position of the control bit in the control register -- i'm only interested in counter 0 (`INST_RETIRED.ANY`)). programmed counters work fine. `UOPS_RETIRED.ALL` works as expected, despite supposedly not being supported on kbl/skl — Lewis Kelsey, Apr 17 '21 at 18:14
@SomeName do you mean if the PMC is disabled, if the PMC doesnt exist, or if the PMC hasn't been programmed yet, or if the PMC has been disabled then reenabled, or if the PMC is programmed with a non event? I can find out — Lewis Kelsey, Apr 17 '21 at 18:22
@PeterCordes oops 1 << 32 wouldn't even fit in ecx anyway. To get `INST_RETIRED.ANY` on my KBL you have to program perfevtsel0 event 0 umask 1 and then rdpmc (1<<30). It returns 0 if you do rdpmc(0) — Lewis Kelsey, Apr 17 '21 at 19:20

score 4 · Answer 2 · edited May 18 '19 at 01:26

The first step is to ensure that the performance counters you want to use are enabled in the IA32_PERF_GLOBAL_CTRL MSR register, whose layout is shown in Figure 18-8 of the Intel Manual Volume 3 (January 2019). You can easily do this by loading the MSR kernel module (sudo modprobe msr) and executing the following command:

sudo rdmsr -a 0x38F

The value 0x38F is the address of the IA32_PERF_GLOBAL_CTRL MSR register and the -a option specifies that the rdmsr instruction should be executed on all logical cores. By default, this should print 7000000ff (when HT is disabled) or 70000000f (when HT is enabled) for all logical cores. For the INST_RETIRED.ANY fixed-function performance counter, the bit at index 32 is the one that enables it, so it should be 1. The value 7000000ff that all of the three fixed-function counters and all of the eight programmable counters are enabled.

The IA32_PERF_GLOBAL_CTRL register has one enable bit for each performance counter per logical core. Each programmable performance counter has also its dedicated control register and there is a control register for all of the fixed-function counters. In particular, the control register for the INST_RETIRED.ANY fixed-function performance counter is IA32_FIXED_CTR_CTRL, whose layout is shown in Figure 18-7 of the Intel Manual Volume 3. There are 12 defined bits in the register, the first 4 bits can be used to control the behavior of the the first fixed-function counter, i.e., INST_RETIRED.ANY (the order is shown in Table 19-2). Before modifying the register, you should first check how it got initialized by the OS by executing:

sudo rdmsr -a 0x38D

It should print 0xb0, by default. This indicates that the second fixed-function counter (unhalted core cycles) is enabled and configured to count in both supervisor mode and user mode. To enable INST_RETIRED.ANY and configure it to count only user mode events while keeping the unhalted core cycles counter as is, execute the following command:

sudo wrmsr -a 0x38D 0xb2

Once this command is executed, the events are counted immediately. You can check this by reading the first fixed-function counter IA32_PERF_FIXED_CTR0 (see Table 19-2):

sudo rdmsr -a 0x309

You can execute that command multiple times and see how the counts on each core are changing. Unfortunately, this means that by the time your program is run, the current value in IA32_PERF_FIXED_CTR0 will be basically some random value. You can try to reset the counter by executing:

sudo wrmsr -a 0x309 0

But the fundamental problem remains; you cannot instantaneously reset the counter and run your program. As suggested in @Peter's answer, the right way to use any performance counter is to wrap the region of interest between rdpmc instructions and take the difference.

The MSR kernel module is very convenient because the only way to access MSR registers is in kernel mode. However, there is an alternative to wrapping the code between rdpmc instructions. You can write your own kernel module and place your code in the kernel module immediately after the instruction that enables the counter. You can even disable interrupts. Typically, this level of accuracy is not worth the effort.

You can use the -p option instead of -a to specify a particular logical core. However, you'll have to make sure that the program is run on the same core with taskset -c 3 ./a.out to run on core #3, for example.

I've ran through these instructions, and they work! – user14717 May 18 '19 at 21:08 — user14717, May 18 '19 at 21:08

rdpmc: surprising behavior

2 Answers2