I was trying to understand the linux perf
, and found some really confusing behavior:
I wrote a simple multi-threading example with one thread pinned to each core; each thread runs computation locally and does not communicate with each other (see test.cc
below). I was thinking that this example should have really low, if not zero, context switches. However, using linux perf
to profile the example shows thousands of context-switches - much more than what I expected. I further profiled the linux command sleep 20
for a comparison, showing much fewer context switches.
This profile result does not make any sense to me. What is causing so many context switches?
> sudo perf stat -e sched:sched_switch ./test
Performance counter stats for './test':
6,725 sched:sched_switch
20.835 seconds time elapsed
> sudo perf stat -e sched:sched_switch sleep 20
Performance counter stats for 'sleep 20':
1 sched:sched_switch
20.001 seconds time elapsed
For reproducing the results, please run the following code:
perf stat -e context-switches sleep 20
perf stat -e context-switches ./test
To compile the source code, please type the following code:
g++ -std=c++11 -pthread -o test test.cc
// test.cc
#include <iostream>
#include <thread>
#include <vector>
int main(int argc, const char** argv) {
unsigned num_cpus = std::thread::hardware_concurrency();
std::cout << "Launching " << num_cpus << " threads\n";
std::vector<std::thread> threads(num_cpus);
for (unsigned i = 0; i < num_cpus; ++i) {
threads[i] = std::thread([i] {
int j = 0;
while (j++ < 100) {
int tmp = 0;
while (tmp++ < 110000000) { }
}
});
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
int rc = pthread_setaffinity_np(threads[i].native_handle(),
sizeof(cpu_set_t), &cpuset);
if (rc != 0) {
std::cerr << "Error calling pthread_setaffinity_np: " << rc << "\n";
}
}
for (auto& t : threads) {
t.join();
}
return 0;
}