OpenMP and GSL RNG - Performance Issue - 4 threads implementation 10x slower than pure sequential one (quadcore CPU)

Question

I am trying to turn a C project of mine from sequential into parallel programming. Although most of the code has now been redesigned from scratch for this purpose, the generation of random numbers is still at its core. Thus, bad performance of the random number generator (RNG) affects very badly the overall performance of the program.

I wrote some code lines (see below) to show the problem I am facing without much verbosity.

The problem is the following: everytime the number of threads nt increases, the performance gets singnificantly worse. At this workstation (linux kernel 2.6.33.4; gcc 4.4.4; intel quadcore CPU) the parallel for-loop takes roughly 10x longer to finish with nt=4 than with nt=1, regardless of the number of iterates n.

This situation seems to be described here but the focus is mainly in fortran, a language I know very little about, so I would very much appreciate some help.

I tried to follow their idea of creating different RNG (with a different seed) to be accessed by each thread but the performance is still very bad. Actually, this different seeding point for each thread bugs me as well, because I cannot see how it is possible for one to guarantee the quality of the generated numbers in the end (lack of correlations, etc).

I have already thought of dropping GSL altogether and implementing a random generator algorithm (such as Mersenne-Twister) myself but I suspect I would just bump into the same issue later on.

Thank you very much in advance for your answers and advice. Please do ask anything important I may have forgotten to mention.

EDIT: Implemented corrections suggested by lucas1024 (pragma for-loop declaration) and JonathanDursi (seeding; setting "a" as a private variable). Performance is still very sluggish in multithread-mode.

EDIT 2: Implemented solution suggested by Jonathan Dursi (see comments).

    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <gsl/gsl_rng.h>
    #include <omp.h>

    double d_t (struct timespec t1, struct timespec t2){

        return (t2.tv_sec-t1.tv_sec)+(double)(t2.tv_nsec-t1.tv_nsec)/1000000000.0;
    }

    int main (int argc, char *argv[]){

        double a, b;

        int i,j,k;

        int n=atoi(argv[1]), seed=atoi(argv[2]), nt=atoi(argv[3]);

        printf("\nn\t= %d", n);
        printf("\nseed\t= %d", seed);
        printf("\nnt\t= %d", nt);

        struct timespec t1, t2, t3, t4;

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t1);

        //initialize gsl random number generator
        const gsl_rng_type *rng_t;
        gsl_rng **rng;
        gsl_rng_env_setup();
        rng_t = gsl_rng_default;

        rng = (gsl_rng **) malloc(nt * sizeof(gsl_rng *));

            #pragma omp parallel for num_threads(nt)
        for(i=0;i<nt;i++){
            rng[i] = gsl_rng_alloc (rng_t);
            gsl_rng_set(rng[i],seed*i);
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t2);

        for (i=0;i<n;i++){
            a = gsl_rng_uniform(rng[0]);
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t3);

        omp_set_num_threads(nt);
        #pragma omp parallel private(j,a)
        {
            j = omp_get_thread_num();
            #pragma omp for
            for(i=0;i<n;i++){
                a = gsl_rng_uniform(rng[j]);
            }
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t4);

        printf("\n\ninitializing:\t\tt1 = %f seconds", d_t(t1,t2));
        printf("\nsequencial for loop:\tt2 = %f seconds", d_t(t2,t3));
        printf("\nparalel for loop:\tt3 = %f seconds (%f * t2)", d_t(t3,t4), (double)d_t(t3,t4)/(double)d_t(t2,t3));
        printf("\nnumber of threads:\tnt = %d\n", nt);

        //free random number generator
        for (i=0;i<nt;i++)
            gsl_rng_free(rng[i]);
        free(rng);

        return 0;

    }

I don't know how your random number generator works, but if it relies on the system entropy pool, that may be running dry. If so check out http://www.entropykey.co.uk/ — blueshift, Mar 29 '12 at 00:01
I am sorry blueshift but I do not think the problem I described is even remotely related to what you pointed out. — dd_rlwll, Mar 29 '12 at 00:33
No problem! Checking /proc/sys/kernel/random/entropy_avail may confirm one way or the other. — blueshift, Mar 29 '12 at 00:39
One issue is that in the parallel loop, all threads are writing to a single shared variable a; that should be made private. Also, you're setting all your seeds to `seed` (not a performance issue, admittedly); it should be seed*i, not seed*omp_get_thread_num() = seed*1 outside of a parallel region. You can also use more compute-intensive RNGs like gsl_rng_mt19937. But there's still a factor of 10-20 performance drop here which doesn't go down with N, and I've got to admit I don't see it. I thought it might be false-sharing with the rng array but spacing things out more it's still there. — Jonathan Dursi, Mar 29 '12 at 02:43
@JonathanDursi, it is true indeed that the seeding was wrong. Actually, and regarding one of my original questions, how can I guarantee the quality of the generated numbers if I seed the different RNG this way (i*seed)? Or any other way (e.g.: i*seed*3141592)? I would say that an overlap, which seems to be (very) possible in this situation, would hugely affect correlations. I agree that it might be useful to use a more compute-intensive RNG and I have just tested it with gsl_rng_mt19937. Unfortunately, the drop in performance is still huge. — dd_rlwll, Mar 29 '12 at 06:17
@JonathanDursi, I have also declared a to be a private variable, which yielded some performance improvement. Actually, in my original code I will have different threads updating elements of the same (big) array. Will this constitute a problem to parallelization? Thanks a lot for your both your replies up to now, btw. — dd_rlwll, Mar 29 '12 at 06:24
@dd_rlwll : if you're updating a shared array (eg, a[i] = gsl_rng_uniform()) that should be fine. In terms of seeding, a nice review paper by Katzgrabber ( http://arxiv.org/abs/1005.4117 ) suggests hashing together both the time (or you could use your initial seed) and the thread number like so - `s = time (); j = omp_thread_num(); seed = abs(((s*181)*((j-83)*359))%104729);` -- you could play around with the particular hash function. — Jonathan Dursi, Mar 29 '12 at 11:23
So I'm now seeing about a 2.5x drop in performance with the threading for large N with 4 threads; that's ~10x off from where it should be... — Jonathan Dursi, Mar 29 '12 at 11:53
Ah -- ok, this is partly a memory locality thing. Above your initialization loop, where you're calling `gsl_rng_set`, put ` #pragma omp parallel for private(i) num_threads(nt)`. — Jonathan Dursi, Mar 29 '12 at 12:04
@JonathanDursi, same here. I think that we can rule out any kind of trouble coming from the compiler or running platform. Thanks a lot for the earlier arxiv reference, btw. — dd_rlwll, Mar 29 '12 at 12:05
I'm now seeing a 1.5x speedup with 4 threads and large N, which is at least a speedup. — Jonathan Dursi, Mar 29 '12 at 12:07
Yeah, so now after making @lucas1024's suggestion about getting rid of the nested parallel, getting rid of the shared access to scalar a, switching to gsl_rng_mt19937, but also doing the initialization in a parallel for loop, with N=1,000,000 I'm seeing close to ideal speedup up to 8 threads. — Jonathan Dursi, Mar 29 '12 at 12:16
@JonathanDursi, what exactly is `#pragma omp parallel for private(i) num_threads(nt)` changing after all? Is it important that each thread not only uses its own RNG but also initializes it? Besides, although it is not really important for this matter, declaring i as private in the for-loop should be redundant, shouldn't it? — dd_rlwll, Mar 29 '12 at 12:17
Yeah, the private doesn't really matter, I just like being explicit. What it's changing is memory locality. If thread 0 initializes all the RNGs, the memory then "belongs" to thread 0, which is a performance problem on multi-socket machines (or machines with multiple caches). By having each thread initialize it's own variables, then it is put in memory associated with the appropriate processor. You'll want to look up "memory affinity". — Jonathan Dursi, Mar 29 '12 at 12:22
@JonathanDursi, I most definitely will. Thank you so much for your help. Now, a final silly question: how can I mark this question as "solved"? I can't seem to find the link/button... — dd_rlwll, Mar 29 '12 at 12:25
As a loose shot --- did you by any chance tried removing the timing calls? — ev-br, Apr 05 '12 at 05:51
@dd_rlwll: There should be a checkmark to click under the score of the answer to accept it. So you can either accept Lobo Antonov's answer if you feel it was sufficiently helpful, or write a new answer and accept that. — Grizzly, Jan 08 '13 at 12:45

score 4 · Answer 1 · answered Mar 29 '12 at 00:11

4

The problem is in the second #pragma omp line. The first #pragma omp spawns 4 threads. After that you are supposed to simply say #pragma omp for - not #pragma omp parallel for.

With the current code, depending on your omp nesting settings, you are creating 4 x 4 threads that are doing the same work and accessing the same data.

answered Mar 29 '12 at 00:11

Lubo Antonov

2,200
13
17

Thanks a lot for your prompt reply, lucas1024! It's amazing how easily a mistake like that got through 100000x proofreading. After making the suggested change, performance improved significantly. However, it is still (way) slower than the single-threaded implementation (roughly 3x here for nt=4). Can someone please confirm this to me in a different platform? I will keep testing here with a different version of gcc, although I suspect something is still wrong with the code. – dd_rlwll Mar 29 '12 at 00:29
Hmm, @dd_rlwll, I am not familiar with this particular RNG, but I would say the problem has to be in its implementation. Your code just doesn't do enough to be any kind of bottleneck. Maybe it's sharing some kind of internal state that is causing synchronization overhead. There are very few multi-thread capable RNGs out there. – Lubo Antonov Mar 29 '12 at 00:47
"There are very few multi-thread capable RNGs out there." - Can you please name/recommend a few? Thanks a lot once again for the extremely useful feedback. – dd_rlwll Mar 29 '12 at 00:50
@dd_rlwll Take a look at this article: [Parallel Random Number Generation Using OpenMP, OpenCL and PGI Accelerator Directives](http://www.pgroup.com/lit/articles/insider/v2n2a4.htm). I don't have a specific one to recommend, though. – Lubo Antonov Mar 29 '12 at 00:57
@JonathanDursi, they should be threadsafe indeed. But I would say this is more about preventing blatant errors (e.g.: short cycle repetition of the generated numbers) and not much about performance. Following the corrections in your other comment, I am now using a different RNG for each thread, seeding each one a different value and using each one to update a private variable inside a loop. Performance is still very poor, though. Thanks a lot for any further input. – dd_rlwll Mar 29 '12 at 06:29
@JonathanDursi, thread-safe is not the same as multi-threaded. Thread-safe usually means that there is synchronization going on - which is the enemy of parallelism :) – Lubo Antonov Mar 29 '12 at 08:32
Yeah, but he's not looking for a multithread RNG, though; he's looking for an RNG he can safely call from several threads to fill up an array. There's no synchronization needed here; the reason gsl_rng_* takes that rng[] argument is that's where all the state information lives for the RNG; there's no additional data needed, so there's nothing to have to synchronize around. – Jonathan Dursi Mar 29 '12 at 11:14
@lucas1024, that's exactly what I thought. – dd_rlwll Mar 29 '12 at 11:50
@JonathanDursi, I'd say I am looking for a multithread RNG indeed. Otherwise, being RNG at the core of the program, it will be impossible (or at least very hard) to take a significant advantage of parallel computation. Please do correct me if I'm wrong. – dd_rlwll Mar 29 '12 at 11:52
This is just a terminology thing; you're not looking for an RNG that uses multiple threads to handle a single RNG request; you're looking to be able to use multiple threads to independantly each call a single RNG - thread safe, not multithreaded. – Jonathan Dursi Mar 29 '12 at 11:52
@JonathanDursi, absolutely true. However, as you said, something is slowing down things (a lot) and the problem most probably lies in the way I am calling the RNG from inside the for-loop. I can't see why, though. – dd_rlwll Mar 29 '12 at 12:09

OpenMP and GSL RNG - Performance Issue - 4 threads implementation 10x slower than pure sequential one (quadcore CPU)

1 Answers1

Linked