4

I have a question about the Microsoft PPL library, and parallel programming in general. I am using FFTW to perform a large set (100,000) of 64 x 64 x 64 FFTs and inverse FFTs. In my current implementation, I use a parallel for loop and allocate the storage arrays within the loop. I have noticed that my CPU usage only tops out at about 60-70% in these cases. (Note this is still better utilization than the built in threaded FFTs provided by FFTW which I have tested). Since I am using fftw_malloc, is it possible that excessive locking is occurring which is preventing full usage?

In light of this, is it advisable to preallocate the storage arrays for each thread before the main processing loop, so no locks are required within the loop itself? And if so, how is this possible with the MSFT PPL library? I have been using OpenMP before, in that case it is simple enough to get a thread ID using supplied functions. I have not however seen a similar function in the PPL documentation.

Stephan Dollberg
  • 27,667
  • 11
  • 72
  • 104
Kyle Lynch
  • 235
  • 1
  • 3
  • 6

2 Answers2

2

I am just answering this because nobody has posted anything yet.

Mutex(e)s can wreak havoc on performance if heavy locking is required. In addition if a lot of memory (re)-allocation is needed, that can also decrease performance and limit it to your memory bandwidth. Like you said a preallocation which later threads operate on can be usefull. However this requires that you have a fixed threadcount and that you spread your workload balanced on all threads.

Concerning the PPL thread_id functions, I can only speak about Intel-TBB, which however should be pretty similiar to PPL. TBB - and I suppose also PPL - is not speaking of threads directly, instead they are talking about tasks, the aim of TBB was to abstract these underlaying details away from the user, thus it does not provide a thread_id function.

Stephan Dollberg
  • 27,667
  • 11
  • 72
  • 104
  • Thanks for the quick reply. I agree with you, the bottleneck appears to be related to the large number of memory allocations; as the FFT size is decreased, the relative parallel speedup increases (e.g., 16x16x16 gives almost linear speedup with cores and 100% utilization). Knowing that PPL is a tasked based framework, do you know of any way to do a memory allocation then for each task? (and not necessarily each loop iteration, as is done now) Basically, having a dynamically allocated set of private data for each thread/task. – Kyle Lynch Apr 04 '12 at 10:17
  • I apologize in advance for my naivety with regard to these frameworks; my background is in engineering, not CS. – Kyle Lynch Apr 04 '12 at 10:21
  • @KyleLynch Well, assuming your workload is balanced, I can think of the following: Keep the grainsize as big as possible, and then in your worker allocate the memory before the loop. EDIT: I just checked the ppl::parallel_for out and the interface is indeed a little bit different from tbb. The advantage of TBB here is that you can work better with a range. For the same behaviour in PPL you probably have to kinda write a wrapper function. I hope the code explains what I mean. http://pastebin.com/Tg4FL6KB – Stephan Dollberg Apr 04 '12 at 12:01
  • Thanks very much! Your suggestions are helpful. I was also able to find a way to control the granularity a bit better using PPL from this link: http://msdn.microsoft.com/en-us/library/gg663527.aspx It now works much better. – Kyle Lynch Apr 04 '12 at 15:47
0

Using PPL I have had good performance with an application that does a lot of allocations by using a Concurrency::combinable to hold a structure containing memory allocated per thread.

In fact you don't have to pre-allocate you can check the value of your combinable variable with ->local() and allocate it if it is null. Next time this thread is called it will already be allocated.

Of course you have to free the memory when all task are done which can be done using: with something like:

combine_each([](MyPtr* p){ delete p; });
Wahyu Kristianto
  • 5,642
  • 5
  • 36
  • 59