12

I have a C++ program which could be parallelized. I'm using Visual Studio 2010, 32bit compilation.

In short the structure of the program is the following

#define num_iterations 64 //some number

struct result
{ 
    //some stuff
}

result best_result=initial_bad_result;

for(i=0; i<many_times; i++)
{ 
    result *results[num_iterations];


    for(j=0; j<num_iterations; j++)
    {
        some_computations(results+j);
    }

    // update best_result; 
}

Since each some_computations() is independent(some global variables read, but no global variables modified) I parallelized the inner for-loop.

My first attempt was with boost::thread,

 thread_group group;
 for(j=0; j<num_iterations; j++)
 {
     group.create_thread(boost::bind(&some_computation, this, result+j));
 } 
 group.join_all();

The results were good, but I decided to try more.

I tried the OpenMP library

 #pragma omp parallel for
 for(j=0; j<num_iterations; j++)
 {
     some_computations(results+j);
 } 

The results were worse than the boost::thread's ones.

Then I tried the ppl library and used parallel_for():

 Concurrency::parallel_for(0,num_iterations, [=](int j) { 
     some_computations(results+j);
 })

The results were the worst.

I found this behaviour quite surprising. Since OpenMP and ppl are designed for the parallelization, I would have expected better results, than boost::thread. Am I wrong?

Why is boost::thread giving me better results?

888
  • 2,869
  • 7
  • 35
  • 58
  • 2
    Could you please quantify "better", e.g. provide execution times versus the number of threads? With `boost::thread` you are creating 64 threads. OpenPM uses a team of worker threads whose number defaults to the number of virtual CPUs. PPL also uses a thread pool and have even higher overhead than OpenMP since it also implements work balancing. – Hristo 'away' Iliev Mar 05 '13 at 09:38
  • I used the same number (32 or 64) for each try, maybe as you pointed out, with OpenMP and ppl I could get better results setting the number of threads equal to the number of cores. I'll try. – 888 Mar 05 '13 at 10:07
  • 1
    It's almost impossible to answer the question as it stand. What is `some_computations` doing? I there possible contention somewhere (which could hit the different libraries differently, e.g. if openmp has actually lower overhead, but you have a lot of writes to shared cachelines the resulting cache invalidation frenzy may actually make it slower)? How long does it take to run through the parallelized block for each variant – Grizzly Mar 07 '13 at 18:56

2 Answers2

10

OpenMP or PPL do no such thing as being pessimistic. They just do as they are told, however there's some things you should take into consideration when you do try to paralellize loops.

Without seeing how you implemented these things, it's hard to say what the real cause may be.

Also if the operations in each iteration have some dependency on any other iterations in the same loop, then this will create contention, which will slow things down. You haven't shown what your some_operation function actually does, so it's hard to tell if there is data dependencies.

A loop that can be truly parallelized has to be able to have each iteration run totally independent of all other iterations, with no shared memory being accessed in any of the iterations. So preferably, you'd write stuff to local variables and then copy at the end.

Not all loops can be parallelized, it is very dependent on the type of work being done.

For example, something that is good for parallelizing is work being done on each pixel of a screen buffer. Each pixel is totally independent from all other pixels, and therefore, a thread can take one iteration of a loop and do the work without needing to be held up waiting for shared memory or data dependencies within the loop between iterations.

Also, if you have a contiguous array, this array may be partly in a cache line, and if you are editing element 5 in thread A and then changing element 6 in thread B, you may get cache contention, which will also slow down things, as these would be residing in the same cache line. A phenomenon known as false sharing.

There is many aspects to think about when doing loop parallelization.

Tony The Lion
  • 57,181
  • 57
  • 223
  • 390
  • you function `some_operation` takes an offset into an array, and the array is shared among several threads. I don't know that either PPL or OpenMP can make any garantuees you're not writing to that array, or that anything else is writing to that array. Therefore my answer doesn't change. – Tony The Lion Mar 04 '13 at 17:30
  • 1
    Your first paragraph is not true. Neither OpenMP nor PPL cares what you do to shared variables and there is nothing pessimistic or optimistic in the way they work. Both are imperative programming concepts, which means that the compiler makes the code parallel if told so rather than treating the expressions just as hints. Proper treatment of shared variables is left solely to the programmer. – Hristo 'away' Iliev Mar 05 '13 at 09:43
3

In short words, openMP is mainly based on shared memory, with additional cost of tasking management and memory management. ppl is designed to handle generic patterns of common data structures and algorithms, it brings additional complexity cost. Both of them have additional CPU cost, but your simple falling down boost threads do not (boost threads are just simple API wrapping). That's why both of them are slower than your boost version. And, since the exampled computation is independent for each other, without synchronization, openMP should be close to the boost version.

It occurs in simple scenarios, but, for complicated scenarios, with complicated data layout and algorithms, it should be context dependent.

Peixu Zhu
  • 1,991
  • 1
  • 13
  • 12