Scalability of multi-threaded vector sum

Question

Here is a piece of C++11 code for a multi-threaded vector sum.

#include <thread>

template<typename ITER>
void sum_partial(ITER a, ITER b, double & result) {
  result = std::accumulate(a, b, 0.0);
}

template<typename ITER>
double sum(ITER begin, ITER end, unsigned int nb_threads) {
  size_t len = std::distance(begin, end);
  size_t size = len/nb_threads;

  std::vector<std::thread> thr(nb_threads-1);
  std::vector<double> r(nb_threads);
  size_t be = 0;
  for(size_t i = 0; i < nb_threads-1; i++) {
    size_t en = be + size;
    thr[i] = std::thread(sum_partial<ITER>, begin + be, begin + en, std::ref(r[i]));
    be = en;
  }
  sum_partial(begin + be, begin + len, r[nb_threads-1]);
  for(size_t i = 0; i < nb_threads-1; i++)
    thr[i].join();
  return std::accumulate(r.begin(), r.end(), 0.0);
}

The typical use will be sum(x.begin(), x.end(), n) with x a vector of doubles.

Here is a graph displaying the computation time as a function of the number of threads (average time for summing 10⁷ values, on a 8 cores computer with nothing else running -- I tried on a 32 cores computer, the behaviour is very similar).

Why is the scalability so poor? Can it be improved?

My (very limited) understanding is that to have good scalability, threads should avoid to write in the same cache line. Here all threads write in r once, at the very end of their computation, I wouldn't expect it to be the limiting factor. Is it a memory bandwidth problem?

well, if it takes [Milliseconds to create thread: 0.015625](https://stackoverflow.com/questions/18274217/how-long-does-thread-creation-and-termination-take-under-windows) It looks like thread creation could be your bottleneck - if your units are wrong — UKMonkey, Jan 09 '18 at 15:16
Good point, but code using a thread pool (through the TBB library) displays a similar behaviour. — Elvis, Jan 09 '18 at 15:18
That would've been the better code to post as it has less variables ;) — UKMonkey, Jan 09 '18 at 15:18
The test can (and should) be adjust to prepare the threads ahead (and start calculation with signal), and even allow vectorization, but he'll still see low scalability due to cache stalls. — Non-maskable Interrupt, Jan 09 '18 at 15:18
@UKMonkey I chose to post the most readable code... Besides, if takes 0.015 milliseconds to create one threads, then to create 6 threads it takes 0.09 milliseconds, which is negligible as compared to a total run time of 4 milliseconds. — Elvis, Jan 09 '18 at 15:21
@Non-maskableInterrupt I trust the standard library to allow vectorization in `std::accumulate`, am I wrong? (anyway that’s a totally different question) — Elvis, Jan 09 '18 at 15:22
`std::accumulate` is template, such vectoization is controlled by compiler optimzation option. You may also want to enable prefetch (updated answer) — Non-maskable Interrupt, Jan 09 '18 at 15:30
@Elvis Yes, you are wrong. `std::accumulate` is defined as a left fold, and vectorization would break this because floating point addition [is not associative](https://stackoverflow.com/questions/10371857/is-floating-point-addition-and-multiplication-associative#10371890). (Depending on the fp strictness settings, the library/compiler might do it anyway, though.) I would definitely argue against using `std::accumulate` for best performance. — Arne Vogel, Jan 09 '18 at 17:57
@Non-maskableInterrupt Nothing prevents a library from having optimized specializations of `accumulate` e.g. for a pointer or vector range of "normal" arithmetic type. In fact, the algorithm implementations in the standard are for exposition only. UPDATE: Err, well, I'm contradicting myself a bit here. Should say provided the result is the same, which in the fp case it's not. A vectorized integer accumulate would be fine, however. — Arne Vogel, Jan 09 '18 at 18:00

Non-maskable Interrupt · Answer 1 · 2018-01-09T15:28:28.713

4

accumulate has low utilitation on cpu arithmetic units, but cache and memory throughput will most likely be the bottleneck, especially for 10^7 double, or 10 million double = 80MB data, which is way more than your CPU cache size.

To overcome the cache and memory throughput bottleneck, you might want to enable prefetch with -fprefetch-loop-arrays, or even manually do some assembly.

edited Jan 09 '18 at 15:28

answered Jan 09 '18 at 15:14

Non-maskable Interrupt

3,641
1
17
26

Scalability of multi-threaded vector sum

1 Answers1