no speedup using openmp + SIMD

Question

I am new to Openmp and now trying to use Openmp + SIMD intrinsics to speedup my program, but the result is far from expectation.

In order to simplify the case without losing much essential information, I wrote a simplier toy example:

#include <omp.h>
#include <stdlib.h>
#include <iostream>
#include <vector>
#include <sys/time.h>

#include "immintrin.h" // for SIMD intrinsics

int main() {
    int64_t size = 160000000;
    std::vector<int> src(size);

    // generating random src data
    for (int i = 0; i < size; ++i)
        src[i] = (rand() / (float)RAND_MAX) * size;

    // to store the final results, so size is the same as src
    std::vector<int> dst(size);

    // get pointers for vector load and store
    int * src_ptr = src.data();
    int * dst_ptr = dst.data();

    __m256i vec_src;
    __m256i vec_op = _mm256_set1_epi32(2);
    __m256i vec_dst;

    omp_set_num_threads(4); // you can change thread count here

    // only measure the parallel part
    struct timeval one, two;
    double get_time;
    gettimeofday (&one, NULL);

    #pragma omp parallel for private(vec_src, vec_op, vec_dst)
    for (int64_t i = 0; i < size; i += 8) {
        // load needed data
        vec_src = _mm256_loadu_si256((__m256i const *)(src_ptr + i));

        // computation part
        vec_dst = _mm256_add_epi32(vec_src, vec_op);
        vec_dst = _mm256_mullo_epi32(vec_dst, vec_src);
        vec_dst = _mm256_slli_epi32(vec_dst, 1);
        vec_dst = _mm256_add_epi32(vec_dst, vec_src);
        vec_dst = _mm256_sub_epi32(vec_dst, vec_src);

        // store results
        _mm256_storeu_si256((__m256i *)(dst_ptr + i), vec_dst);
    }

    gettimeofday(&two, NULL);
    double oneD = one.tv_sec + (double)one.tv_usec * .000001;
    double twoD = two.tv_sec + (double)two.tv_usec * .000001;
    get_time = 1000 * (twoD - oneD);
    std::cout << "took time: " << get_time << std::endl;

    // output something in case the computation is optimized out
    int64_t i = (int)((rand() / (float)RAND_MAX) * size);
    for (int64_t i = 0; i < size; ++i)
        std::cout << i << ": " << dst[i] << std::endl;

    return 0;
}

It is compiled using icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp test.cpp -o test and the elapsed time of the parallel part is measured. The result is as follows (the median value is picked out of 5 runs each):

1 thread: 92.519

2 threads: 89.045

4 threads: 90.361

The computations seem embarrassingly parallel, as different threads can load their needed data simultaneously given different indices, and the case is similar for writing the results, but why no speedups?

More information:

I checked the assembly code using icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp -S test.cpp and found vectorized instructions are generated;
To check if it is memory-bound, I commented the computation part in the loop, and the measured time decreased to around 60, but it does not change much if I change the thread count from 1 -> 2 -> 4.

Any advice or clue is welcome.

EDIT-1:

Thank @JerryCoffin for pointing out the possible cause, so I did the Memory Access Analysis using Vtune. Here are the results:

1-thread: Memory Bound: 6.5%, L1 Bound: 0.134, L3 Latency: 0.039

2-threads: Memory Bound: 18.0%, L1 Bound: 0.115, L3 Latency: 0.015

4-threads: Memory Bound: 21.6%, L1 Bound: 0.213, L3 Latency: 0.003

It is an Intel 4770 Processor with 25.6GB/s (23GB/s measured by Vtune) max. bandwidth. The memory bound does increase, but I am still not sure if that is the cause. Any advice?

EDIT-2 (just trying to give thorough information, so the appended stuff can be long but not tedious hopefully):

Thanks for the suggestions from @PaulR and @bazza. I tried 3 ways for comparison. One thing to note is that the processor has 4 cores and 8 hardware threads. Here are the results:

(1) just initialize dst as all zeros in advance: 1 thread: 91.922; 2 threads: 93.170; 4 threads: 93.868 --- seems not effective;

(2) without (1), put the parallel part in an outer loop over 100 iterations, and measure the time of the 100 iterations: 1 thread: 9109.49; 2 threads: 4951.20; 4 threads: 2511.01; 8 threads: 2861.75 --- quite effective except for 8 threads;

(3) based on (2), put one more iteration before the 100 iterations, and measure the time of the 100 iterations: 1 thread: 9078.02; 2 threads: 4956.66; 4 threads: 2516.93; 8 threads: 2088.88 --- similar with (2) but more effective for 8 threads.

It seems more iterations can expose the advantages of openmp + SIMD, but the computation / memory access ratio is unchanged regardless loop count, and locality seems not to be the reason as well since src or dst is too large to stay in any caches, therefore no relations exist between consecutive iterations.

Any advice?

EDIT 3:

In case of misleading, one thing needs to be clarified: in (2) and (3), the openmp directive is outside the added outer loop

#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int k = 0; k < 100; ++k) {
    for (int64_t i = 0; i < size; i += 8) {
        ......
    }
}

i.e. the outer loop is parallelized using multithreads, and the inner loop is still serially processed. So the effective speedup in (2) and (3) might be achieved by enhanced locality among threads.

I did another experiment that the the openmp directive is put inside the outer loop:

for (int k = 0; k < 100; ++k) {
    #pragma omp parallel for private(vec_src, vec_op, vec_dst)
    for (int64_t i = 0; i < size; i += 8) {
        ......
    }
}

and the speedup is still not good: 1 thread: 9074.18; 2 threads: 8809.36; 4 threads: 8936.89.93; 8 threads: 9098.83.

Problem still exists. :(

EDIT-4:

If I replace the vectorized part with scalar operations like this (the same calculations but in scalar way):

#pragma omp parallel for
for (int64_t i = 0; i < size; i++) { // not i += 8
    int query = src[i];
    int res = src[i] + 2;
    res = res * query;
    res = res << 1;
    res = res + query;
    res = res - query;
    dst[i] = res;
}

The speedup is 1 thread: 92.065; 2 threads: 89.432; 4 threads: 88.864. May I come to the conclusion that the seemingly embarassing parallel is actually memory bound (the bottleneck is load / store operations)? If so, why can't load / store operations well parallelized?

I'd start by looking at the bandwidth you're using to main memory. At least at first glance, this looks like there's a good chance it's memory bound even with a single core--in which case adding more cores isn't going to gain much. — Jerry Coffin, Mar 20 '17 at 04:56
@JerryCoffin Thanks for the possible clue. Do you mean that most of the time is spent on load/store instead of computing? Originally I though the memory access locality was good because they are consecutive reads/writes for each thread. I will profile the memory accesses using vtune and update later. If that is the problem, do you have any advice on resolving it? — MarZzz, Mar 20 '17 at 06:27
Your timing may be skewed by page faults on `dst` (due to lazy allocation) - either initialise `dst` explicitly, and/or run your benchmark inside an outer loop over 2 or more iterations and ignore the timing for the first iteration. — Paul R, Mar 20 '17 at 07:10
Your entire data set, src and dst combined, is 1.2GByte. The processing that you do on each item in src is quite lightweight, so it feels like it ought to be memory bound. However, reading / writing the whole 1.2GBytes to SDRAM should take no more than 51milliseconds, so there is some other overhead at work here. It's well worth trying out the suggestion from @PaulR — bazza, Mar 20 '17 at 07:36
Another thing to look at: your loop only takes around 90 ms currently - so the granularity for any kind of parallel processing with threads/OMP/whatever may be too small to yield any significant benefit - try increasing `size` by an order of magnitude (or add an outer loop) ? — Paul R, Mar 20 '17 at 08:30
could this be due to cache misalignment, because you are preparing src data using scalar and using them as vec — Isuru H, Mar 20 '17 at 11:21
@IsuruH Thanks for your advice but I also tried a pure scalar version parallelized using openmp, but the speedup is still poor with the thread count increasing. The scalar version performs the same calclulations as above vector version. Any other possible cause? — MarZzz, Mar 20 '17 at 11:34
did you mean your parallelized scalar version did not scale linearly? As this is embarrassing parallel, I would expect a linear speedup. — Isuru H, Mar 20 '17 at 11:38
I think the issue might be related to data for each thread not being aligned to cache lines. Thread 1 will read 0, 8, 16th entry and thread 1 will read 1, 9, 17 etc. This will make coherency to kick in and slowdown the whole thing. Instead of a vector, try to allocate a 2D array, so that memory locations required by each thread are adjacent. Give that a go, hopefully you'll see a better speedup curve. — Isuru H, Mar 20 '17 at 14:18
@IsuruH definitely not. OpenMP implementations do not use such bad tiny chunk sizes by default. Performance would be **way** worse if that was the case, time for >1 thread would drastically increase. — Zulan, Mar 20 '17 at 16:59
@IsuruH I made a copy / paste mistake that in the scalar version of EDIT-4, it should be `i++` instead of `i += 8`. Sorry for misleading. — MarZzz, Mar 21 '17 at 00:00

Zulan · Answer 1 · 2017-03-21T11:51:20.407

3

May I come to the conclusion that the seemingly embarassing parallel is actually memory bound (the bottleneck is load / store operations)? If so, why can't load / store operations well parallelized?

Yes this problem is embarrassingly parallel in the sense that it is easy to parallelize due to the lack of dependencies. That doesn't imply that it will scale perfectly. You can still have a bad initialization overhead vs work ratio or shared resources limiting your speedup.

In your case, you are indeed limited by memory bandwidth. A practical consideration first: When compile with icpc (16.0.3 or 17.0.1), the "scalar" version yields better code when size is made constexpr. This is not due to the fact that it optimizes away these two redundant lines:

res = res + query;
res = res - query;

It does, but that makes no difference. Mainly the compiler uses exactly the same instruction that you do with the intrinsic, except for the store. Fore the store, it uses vmovntdq instead of vmovdqu, making use of sophisticated knowledge about the program, memory and the architecture. Not only does vmovntdq require aligned memory and can therefore be more efficient. It gives the CPU a non-temporal hint, preventing this data from being cached during the write to memory. This improves performance, because writing it to cache requires to load the remainder of the cache-line from memory. So while your initial SIMD version does require three memory operations: Reading the source, reading the destination cache line, writing the destination, the compiler version with the non-temporal store requires only two. In fact On my i7-4770 system, the compiler-generated version reduces the runtime at 2 threads from ~85.8 ms to 58.0 ms, and almost perfect 1.5x speedup. The lesson here is to trust your compiler unless you know the architecture and instruction set extremely well.

Considering peak performance here, 58 ms for transferring 2*160000000*4 byte corresponds to 22.07 GB/s (summarizing read and write), which is about the same than your VTune results. (funny enough considering 85.8 ms is about the same bandwidth for two read, one write). There isn't much more direct room for improvement.

To further improve performance, you would have to do something about the operation / byte ratio of your code. Remember that your processor can perform 217.6 GFLOP/s (I guess either the same or twice for intops), but can only read&write 3.2 G int/s. That gives you an idea how much operations you need to perform to not be limited by memory. So if you can, work on the data in blocks so that you can reuse data in caches.

I cannot reproduce your results for (2) and (3). When I loop around the inner loop, the scaling behaves the same. The results look fishy, particularly in the light of the results being so consistent with peak performance otherwise. Generally, I recommend to do the measuring inside of the parallel region and leverage omp_get_wtime like such:

  double one, two;
#pragma omp parallel 
  {
    __m256i vec_src;
    __m256i vec_op = _mm256_set1_epi32(2);   
    __m256i vec_dst;

#pragma omp master
    one = omp_get_wtime();
#pragma omp barrier
    for (int kk = 0; kk < 100; kk++)
#pragma omp for
    for (int64_t i = 0; i < size; i += 8) {
        ...
    }
#pragma omp master
    {
      two = omp_get_wtime();
      std::cout << "took time: " << (two-one) * 1000 << std::endl;
    }
  }

A final remark: Desktop processors and server processors have very different characteristics regarding memory performance. On contemporary server processors, you need much more active threads to saturate the memory bandwidth, while on desktop processors a core can often almost saturate the memory bandwidth.

Edit: One more thought about VTune not classifying it as memory-bound. This may be cause by the short computation time vs initialization. Try to see what VTune says about the code in a loop.

edited Mar 21 '17 at 11:51

answered Mar 20 '17 at 13:57

Zulan

20,904
6
41
90

1. If understood correct, is the compiler-generated version you mentioned the scalar implementation in EDIT-4? I also compiled it using `icpc` with `-O3`, but the speedup is not obvious, maybe because you use `17.0.1` but I use `16.0.3`? – MarZzz Mar 21 '17 at 05:59
2. I think the good scaling in (2) and (3) make sense as different threads almost deal with the same data at the same time, so once the needed data is fetched into the shared cache (L3 cache is shared) by one thread, other threads won't bother accessing main memory, Do you think that is a possible reason? But that is not the case if we change the position of the directive as described in EDIT-3. – MarZzz Mar 21 '17 at 06:00
Interesting point about server and desktop processors and multiple threads to get maximum memory bandwidth. [I observed this awhile back](https://stackoverflow.com/questions/25179738/measuring-memory-bandwidth-from-the-dot-product-of-two-arrays). I thought that was on a desktop processor but it turns out it was a single Xeon processor. Could you explain why a single socket server processor needs multiple threads to so get maximum memory bandwidth but desktop processor does not? I thought the main difference with the Xeon processors was a large L3 cache. – Z boson Mar 21 '17 at 08:10
1

Note that since Ivy Bridge for memory copy and set operations you can do better than non-temporal stores [using "enchanced rep movsb"](https://stackoverflow.com/questions/26246040/whats-missing-sub-optimal-in-this-memcpy-implementation/26256216#26256216). – Z boson Mar 21 '17 at 08:15
1

Do GCC or Clang, generate non-temporal stores? Last time I looked into this they did not. It's something you have to do by hand. OTOH, the rule of thumb according to Agner Fog is to use non-temporal stores when the memory size is larger than half the last level cache (TLC). That's clearly the case in the OP's code but how can a compiler optimize for this in general? If the memory size is much less than half the TLC non-temporal stores are going to give worse performance. It would be interesting to see how ICC works this out. – Z boson Mar 21 '17 at 08:18
@MarZzz / @z-boson I'm afraid I unintentionally cheated. I made `size` `constexpr`, this seems to help the compiler to use non-temporal stores - otherwise it produces exactly the same code as the intrinsics with `vmovdqu`. – Zulan Mar 21 '17 at 11:54
@MarZzz I don't see how the outer loop should help caching as each outer iteration processes the entire >> L3 data set, right? Can you share the specific code you used? – Zulan Mar 21 '17 at 11:55
1

@Zboson Server processors generally have more memory channels, more cores, but less frequency per core. For example the i7-4770 supports dual channel DDR3 1600 (25.6 GB/s) whereas an E5-2690 v3 supports quad channel DDR4 2133 (68 GB/s). But frequencies are 3.4 GHz vs 2.6 GHz. – Zulan Mar 21 '17 at 12:05
@Zboson I can't seem to get GCC to produce non-temporal stores. Do you have an idea how to convince any compiler to use the "enhanced rep movsb"? Or could you suggest an intrinsic that works here? Also may I kindly invite you to take a look [at this question](http://stackoverflow.com/q/42558907/620382). – Zulan Mar 21 '17 at 12:12
@Zulan, I have only read about "enhanced rep movsb". I don't think it applies to the OP's question anyway because it only applies to memcpy/memset. I'm not surprised GCC does not produce non-temporal stores. Did you check Clang? – Z boson Mar 21 '17 at 12:21
1

@Zboson my clang 3.8 isn't friends with OpenMP, but also does not produce non-temporal stores on a serial version. – Zulan Mar 21 '17 at 12:28
@Zulan Just a guess: when the directive is outside the outer loop, each thread processes the entire and the same `src`, thus loads may be well balanced between threads and they might process the same data at the same / similar time, so the data items processed have good locality among threads. E.g. when `Thread 0` processes data `i` in `src`, it is fetched from memory into cache, and when `Thread 1` needs `i`, it is already in cache. This is just my guess and I did not profile it to confirm. This code is the **first** version of EDIT-3. May I have your opinion? – MarZzz Mar 22 '17 at 00:44
@Zulan But when the directive is inside the outer loop, each thread processes a part but different data items simultaneously, so the scaling is different from the above-mentioned version. This is is the **second** version of EDIT-3. – MarZzz Mar 22 '17 at 00:47
@MarZzz your reasoning is correct. I wasn't thinking that you `parallel` the outer loop - that leads to data races. So it might be faster, but wrong. – Zulan Mar 22 '17 at 19:44

no speedup using openmp + SIMD

1 Answers1