6

Here is a piece of C++11 code for a multi-threaded vector sum.

#include <thread>

template<typename ITER>
void sum_partial(ITER a, ITER b, double & result) {
  result = std::accumulate(a, b, 0.0);
}

template<typename ITER>
double sum(ITER begin, ITER end, unsigned int nb_threads) {
  size_t len = std::distance(begin, end);
  size_t size = len/nb_threads;

  std::vector<std::thread> thr(nb_threads-1);
  std::vector<double> r(nb_threads);
  size_t be = 0;
  for(size_t i = 0; i < nb_threads-1; i++) {
    size_t en = be + size;
    thr[i] = std::thread(sum_partial<ITER>, begin + be, begin + en, std::ref(r[i]));
    be = en;
  }
  sum_partial(begin + be, begin + len, r[nb_threads-1]);
  for(size_t i = 0; i < nb_threads-1; i++)
    thr[i].join();
  return std::accumulate(r.begin(), r.end(), 0.0);
}

The typical use will be sum(x.begin(), x.end(), n) with x a vector of doubles.

Here is a graph displaying the computation time as a function of the number of threads (average time for summing 10⁷ values, on a 8 cores computer with nothing else running -- I tried on a 32 cores computer, the behaviour is very similar).

enter image description here

Why is the scalability so poor? Can it be improved?

My (very limited) understanding is that to have good scalability, threads should avoid to write in the same cache line. Here all threads write in r once, at the very end of their computation, I wouldn't expect it to be the limiting factor. Is it a memory bandwidth problem?

Elvis
  • 508
  • 2
  • 12
  • 2
    needs units on y axis. – UKMonkey Jan 09 '18 at 15:14
  • @UKMonkey it's in seconds but I don't think it matters... – Elvis Jan 09 '18 at 15:15
  • 3
    well, if it takes [Milliseconds to create thread: 0.015625](https://stackoverflow.com/questions/18274217/how-long-does-thread-creation-and-termination-take-under-windows) It looks like thread creation could be your bottleneck - if your units are wrong – UKMonkey Jan 09 '18 at 15:16
  • Good point, but code using a thread pool (through the TBB library) displays a similar behaviour. – Elvis Jan 09 '18 at 15:18
  • That would've been the better code to post as it has less variables ;) – UKMonkey Jan 09 '18 at 15:18
  • The test can (and should) be adjust to prepare the threads ahead (and start calculation with signal), and even allow vectorization, but he'll still see low scalability due to cache stalls. – Non-maskable Interrupt Jan 09 '18 at 15:18
  • @UKMonkey I chose to post the most readable code... Besides, if takes 0.015 milliseconds to create one threads, then to create 6 threads it takes 0.09 milliseconds, which is negligible as compared to a total run time of 4 milliseconds. – Elvis Jan 09 '18 at 15:21
  • @Non-maskableInterrupt I trust the standard library to allow vectorization in `std::accumulate`, am I wrong? (anyway that’s a totally different question) – Elvis Jan 09 '18 at 15:22
  • 1
    @Elvis which is why units matter. – UKMonkey Jan 09 '18 at 15:30
  • `std::accumulate` is template, such vectoization is controlled by compiler optimzation option. You may also want to enable prefetch (updated answer) – Non-maskable Interrupt Jan 09 '18 at 15:30
  • 1
    @Elvis Yes, you are wrong. `std::accumulate` is defined as a left fold, and vectorization would break this because floating point addition [is not associative](https://stackoverflow.com/questions/10371857/is-floating-point-addition-and-multiplication-associative#10371890). (Depending on the fp strictness settings, the library/compiler might do it anyway, though.) I would definitely argue against using `std::accumulate` for best performance. – Arne Vogel Jan 09 '18 at 17:57
  • @Non-maskableInterrupt Nothing prevents a library from having optimized specializations of `accumulate` e.g. for a pointer or vector range of "normal" arithmetic type. In fact, the algorithm implementations in the standard are for exposition only. UPDATE: Err, well, I'm contradicting myself a bit here. Should say provided the result is the same, which in the fp case it's not. A vectorized integer accumulate would be fine, however. – Arne Vogel Jan 09 '18 at 18:00
  • @ArneVogel ok, thanks, this is very informative – Elvis Jan 09 '18 at 20:45

1 Answers1

4

accumulate has low utilitation on cpu arithmetic units, but cache and memory throughput will most likely be the bottleneck, especially for 10^7 double, or 10 million double = 80MB data, which is way more than your CPU cache size.


To overcome the cache and memory throughput bottleneck, you might want to enable prefetch with -fprefetch-loop-arrays, or even manually do some assembly.

Non-maskable Interrupt
  • 3,641
  • 1
  • 17
  • 26