Questions tagged [intel-pmu]

Questions related to the use of the Intel Performance Management Unit, which provides performance counters related to the performance of currently executing code.

The Intel performance management unit provides performance counters which track performance related metrics for the currently executing code.

They are useful while profiling code, and are supported by Intel's VTune, Linux's perf command and the Windows Performance Toolkit.

The counters and the details of how to program them vary by CPU architecture and the details are available in Chapter 18 and 19 of the Intel-64 and IA-32 Architectures Software Developer Manual, Volume 3.

65 questions
19
votes
2 answers

Haswell memory access

I was experimenting with AVX -AVX2 instruction sets to see the performance of streaming on consecutive arrays. So I have below example, where I do basic memory read and store. #include #include #include #include…
edorado
  • 275
  • 2
  • 10
18
votes
1 answer

What restriction is perf_event_paranoid == 1 actually putting on x86 perf?

Newer Linux kernels have a sysfs tunable /proc/sys/kernel/perf_event_paranoid which allows the user to adjust the available functionality of perf_events for non-root users, with higher numbers being more secure (offering correspondingly less…
BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
17
votes
0 answers

On Skylake (SKL) why are there L2 writebacks in a read-only workload that exceeds the L3 size?

Consider the following simple code: #include #include #include #include #include int cpu_ms() { return (int)(clock() * 1000 / CLOCKS_PER_SEC); } int main(int argc, char** argv) { if (argc <…
BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
14
votes
5 answers

Can the Intel performance monitor counters be used to measure memory bandwidth?

Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level).
BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
11
votes
2 answers

Reliability of Xcode Instrument's disassembly time profiling

I've profiled my code using Instrument's time profiler, and zooming in to the disassembly, here's a snippet of its results: I wouldn't expect a mov instruction to take 23.3% of the time while a div instruction to take virtually nothing. This causes…
yairchu
  • 21,122
  • 7
  • 65
  • 104
10
votes
1 answer

Can the LSD issue uOPs from the next iteration of the detected loop?

I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop: BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 10000000 .loop: dec ecx ;| jz .end ;| 1…
Margaret Bloom
  • 33,863
  • 5
  • 53
  • 91
9
votes
2 answers

Why does the number of uops per iteration increase with the stride of streaming loads?

Consider the following loop: .loop: add rsi, OFFSET mov eax, dword [rsi] dec ebp jg .loop where OFFSET is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the…
Hadi Brais
  • 18,864
  • 3
  • 43
  • 78
7
votes
2 answers

rdpmc: surprising behavior

I'm trying to understand the rdpmc instruction. As such I have the following asm code: segment .text global _start _start: xor eax, eax mov ebx, 10 .loop: dec ebx jnz .loop mov ecx, 1<<30 ; calling rdpmc with ecx = (1<<30)…
user14717
  • 3,854
  • 2
  • 26
  • 61
7
votes
0 answers

Why does Linux perf use event l1d.replacement for "L1 dcache misses" on x86?

On Intel x86, Linux uses the event l1d.replacements to implement its L1-dcache-load-misses event. This event is defined as follows: Counts L1D data line replacements including opportunistic replacements, and replacements that require…
BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
7
votes
1 answer

Hardware cache events and perf

When I run perf list I see a bunch of Hardware Cache Events, as follows: $ perf list | grep 'cache event' L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware…
BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
6
votes
1 answer

Why are the user-mode L1 store miss events only counted when there is a store initialization loop?

Summary Consider the following loop: loop: movl $0x1,(%rax) add $0x40,%rax cmp %rdx,%rax jne loop where rax is initialized to the address of a buffer that is larger than the L3 cache size. Every iteration performs a store operation to…
Hadi Brais
  • 18,864
  • 3
  • 43
  • 78
6
votes
2 answers

Can we measure successful store-forwarding with Intel's performance counters?

Is it possible to measure the number of successful store-forwarding operations using the performance counters on recent Intel x86 chips? I see events for ld_blocks.store_forward which measure failed store-forwarding, but it's clear to me if the…
BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
5
votes
1 answer

PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE concurrent monitoring

I'm working on a custom implementation on top of perf_event_open syscall. The implementation aims to support various of PERF_TYPE_HARDWARE, PERF_TYPE_SOFTWARE and PERF_TYPE_HW_CACHE events for specific threads on any core. In Intel® 64 and IA-32…
Orion Papadakis
  • 338
  • 1
  • 14
5
votes
2 answers

Is it possible for the RESOURCE_STALLS.RS event to occur even when the RS is not completely full?

The description of the RESOURCE_STALLS.RS hardware performance event for Intel Broadwell is the following: This event counts stall cycles caused by absence of eligible entries in the reservation station (RS). This may result from RS overflow, or …
Hadi Brais
  • 18,864
  • 3
  • 43
  • 78
5
votes
4 answers

Hardware Performance counter on Intel Core Duo

I have read that there are AMD processors out there that allow you to measure the number of cache hits and misses. I am wondering if also such a feature is available on Intel Core Duo machines or if they do not support this yet.
Alex12
  • 81
  • 3
1
2 3 4 5