I'm working on a custom implementation on top of perf_event_open
syscall.
The implementation aims to support various of PERF_TYPE_HARDWARE
, PERF_TYPE_SOFTWARE
and PERF_TYPE_HW_CACHE
events for specific threads on any core.
In Intel® 64 and IA-32 Architectures Software Developer’s Manual vol 3B I see the following for my testing CPU (Kaby Lake):
To my understanding so far, one can monitor (theoretically) unlimited PERF_TYPE_SOFTWARE
events concurrently but limited (without multiplexing) PERF_TYPE_HARDWARE
and PERF_TYPE_HW_CACHE
events concurrently since each event is measured by the limited (as can be seen on the manual above) number of counters of the CPU's PMU.
So for a quad-core Kaby Lake CPU with HyperThreading enabled I assume that up to 4 PERF_TYPE_HARDWARE
/PERF_TYPE_HW_CACHE
events can be monitored concurrently (or up to 8 if only 4 threads are used).
Experimenting with the above assumptions I found out that while I can successfully monitor up to 4 PERF_TYPE_HARDWARE
events (for 8 threads) this is not the case for PERF_TYPE_HW_CACHE
events where only up to 2 events can be monitored concurrently!
I also tried to use only 4 threads but the upper limit of concurrently monitored 'PERF_TYPE_HARDWARE' events remains 4. The same is happening with HyperThreading disabled!
One could ask: why do you need to avoid multiplexing. First of all, the implementation needs to be as much accurate as possible by avoiding the potential blind spots of multiplexing and secondly when the "upper limit" is exceeded all event values are 0...
The PERF_TYPE_HW_CACHE
events I'm targeting are:
CACHE_LLC_READ(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_READ.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_ACCESS.value << 16),
CACHE_LLC_WRITE(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_WRITE.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_ACCESS.value << 16),
CACHE_LLC_READ_MISS(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_READ.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_MISS.value << 16),
CACHE_LLC_WRITE_MISS(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_WRITE.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_MISS.value << 16),
all are implemented with the provided formula:
(perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
(perf_hw_cache_op_result_id << 16)
and are manipulated as a group (the first is the group leader etc).
So, my questions are the following:
- Which counters of the PMU are used for
PERF_TYPE_HARDWARE
and which forPERF_TYPE_HW_CACHE
events and where can I find this information? - What is the difference between the
PERF_TYPE_HARDWARE
pre-defined events (such asPERF_COUNT_HW_CACHE_MISSES
) and thePERF_TYPE_HW_CACHE
events? - Any suggestions on how to monitor without multiplexing all listed
PERF_TYPE_HW_CACHE
events? - Any suggestions on how to monitor without multiplexing up to 8
PERF_TYPE_HARDWARE
or/andPERF_TYPE_HW_CACHE
events?
Thanks in advance!