SIMD alignment issue with PPL Combinable

Question

I'm trying to sum the elements of array in parallel with SIMD. To avoid locking I'm using combinable thread local which is not always aligned on 16 bytes because of that _mm_add_epi32 is throwing exception

concurrency::combinable<__m128i> sum_combine;

int length = 40; // multiple of 8
concurrency::parallel_for(0, length , 8, [&](int it)
{

    __m128i v1 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it));
    __m128i v2 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it + sizeof(uint32_t)));

    auto temp = _mm_add_epi32(v1, v2);

    auto &sum = sum_combine.local();   // here is the problem 


    TRACE(L"%d\n", it);
    TRACE(L"add %x\n", &sum);

    ASSERT(((unsigned long)&sum & 15) == 0);

    sum = _mm_add_epi32(temp, sum);
}
);

here is defination of combinable from ppl.h

template<typename _Ty>
class combinable
{
private:

// Disable warning C4324: structure was padded due to __declspec(align())
// This padding is expected and necessary.
#pragma warning(push)
#pragma warning(disable: 4324)
    __declspec(align(64))
    struct _Node
    {
        unsigned long _M_key;
        _Ty _M_value;                   // this might not be aligned on 16 bytes
        _Node* _M_chain;

        _Node(unsigned long _Key, _Ty _InitialValue)
            : _M_key(_Key), _M_value(_InitialValue), _M_chain(NULL)
        {
        }
    };

sometimes alignment is ok and the code works fine, but most of time its not working

I have tried to used the following, but this doesn't compile

union combine 
{
        unsigned short x[sizeof(__m128i) / sizeof(unsigned int)];
        __m128i y;
};

concurrency::combinable<combine> sum_combine;
then auto &sum = sum_combine.local().y;

Any suggestions for correcting the alignment issue, still using combinable.

On x64 it works fine bcause of default 16 bytes alignment. On x86 sometimes alignment problems exists.

Have you tried using the `#pragma pack(push, 1)` and `#pragma pack(pop)` directives in your struct definition? The compiler can pad your structure with 0's unless you explicitly tell it not to. — Mike, Jul 12 '15 at 14:51
Is an unaligned load OK? You may lose some performance particularly on older processors (eg Core2) — harold, Jul 12 '15 at 15:07
@harold but input_arr is already aligned, how will it effect sum alignment, is there _mm_addu_epi32 — vito, Jul 12 '15 at 15:14
OK - but `sizeof(uint32_t)` is completely inappropriate in this context - you're adding 4 `uint32_t` *elements*, not 4 bytes - it's just a coincidence that they have the same value. If anything it should be `sizeof(__m128i) / sizeof(uint32_t)`. Anyway, this is not the immediate problem, but bear it in mind, as it will be confusing to anyone else reading the code (including yourself in 6 months' time!). — Paul R, Jul 12 '15 at 15:34

vito · Accepted Answer · 2015-07-12T15:58:12.750

1

Just loaded sum using unaligned load

auto &sum = sum_combine.local();


#if !defined(_M_X64) 

if (((unsigned long)&sum & 15) != 0)
{
    // just for breakpoint means, sum  is unaligned.
    int a = 5;
}
auto sum_temp = _mm_loadu_si128(&sum);
sum = _mm_add_epi32(temp, sum_temp);

#else

sum = _mm_add_epi32(temp, sum);

#endif

edited Jul 12 '15 at 15:58

answered Jul 12 '15 at 15:48

vito

317
4
15

Paul R · Answer 2 · 2015-07-12T15:29:23.320

0

Since the sum variable being used with _mm_add_epi32 is not aligned you need to explicitly load/store sum using unaligned loads/stores (_mm_loadu_si128/_mm_storeu_si128). Change:

sum = _mm_add_epi32(temp, sum);

to:

__m128i v2 = _mm_loadu_si128((__m128i *)&sum);
v2 = _mm_add_epi32(v2, temp);
_mm_storeu_si128((__m128i *)&sum, v2);

edited Jul 12 '15 at 15:29

answered Jul 12 '15 at 15:16

Paul R

195,989
32
353
519

still same behavior 50 % times exception. i tried__m128i v2 using _mm_loadu_si128. but what if sum.sum_combine() is unaligned – vito Jul 12 '15 at 15:25
Oh, right - it's the *second* `_mm_add_epi32` that's crashing - in that case you need to use explicit unaligned loads. Let me update my answer... – Paul R Jul 12 '15 at 15:27
@Pual R thanks for answer and suggestion. i did it a bit differently. Now its working auto &sum = sum_combine.local(); auto sum_temp = _mm_loadu_si128(&sum); if(((unsigned long)&sum & 15) != 0) { // just for breakpoint this is unaligned. int a = 5; } sum = _mm_add_epi32(temp, sum_temp ); – vito Jul 12 '15 at 15:41

SIMD alignment issue with PPL Combinable

2 Answers2

Linked