Here is a piece of C++11 code for a multi-threaded vector sum.
#include <thread>
template<typename ITER>
void sum_partial(ITER a, ITER b, double & result) {
result = std::accumulate(a, b, 0.0);
}
template<typename ITER>
double sum(ITER begin, ITER end, unsigned int nb_threads) {
size_t len = std::distance(begin, end);
size_t size = len/nb_threads;
std::vector<std::thread> thr(nb_threads-1);
std::vector<double> r(nb_threads);
size_t be = 0;
for(size_t i = 0; i < nb_threads-1; i++) {
size_t en = be + size;
thr[i] = std::thread(sum_partial<ITER>, begin + be, begin + en, std::ref(r[i]));
be = en;
}
sum_partial(begin + be, begin + len, r[nb_threads-1]);
for(size_t i = 0; i < nb_threads-1; i++)
thr[i].join();
return std::accumulate(r.begin(), r.end(), 0.0);
}
The typical use will be sum(x.begin(), x.end(), n)
with x
a vector of doubles.
Here is a graph displaying the computation time as a function of the number of threads (average time for summing 10⁷ values, on a 8 cores computer with nothing else running -- I tried on a 32 cores computer, the behaviour is very similar).
Why is the scalability so poor? Can it be improved?
My (very limited) understanding is that to have good scalability, threads should avoid to write in the same cache line. Here all threads write in r
once, at the very end of their computation, I wouldn't expect it to be the limiting factor. Is it a memory bandwidth problem?