Isn't __m128d aligned natively?

Question

I've this code:

double a[bufferSize];
double b[voiceSize][bufferSize];
double c[voiceSize][bufferSize];

...

inline void AddIntrinsics(int voiceIndex, int blockSize) {
    // assuming blockSize / 2 == 0 and voiceIndex is within the range
    int iters = blockSize / 2;
    __m128d *pA = (__m128d*)a;
    __m128d *pB = (__m128d*)b[voiceIndex];
    double *pC = c[voiceIndex];

    for (int i = 0; i < iters; i++, pA++, pB++, pC += 2) {
        _mm_store_pd(pC, _mm_add_pd(*pA, *pB));
    }   
}

But "sometimes" it raise Access memory violation, which I think its due to the lacks of memory alignment of my 3 arrays a, b and c.

But since I operate on __m128d (which use __declspec(align(16))), isn't the alignment guaranteed when I cast to those pointer?

Or since it would use __m128d as "register", it could mov directly on register from an unaligned memory (hence, the exception)?

If so, how would you align arrays in C++ for this kind of stuff? std::align?

I'm on Win x64, MSVC, Compiling in Release mode 32 and 64 bit.

Platform and compiler? And why aren't you checking the range of voiceIndex? — bvj, Dec 14 '18 at 08:23
@Bathsheba: Dereferencing a `__m128d*` actually *is* fine and portable according to Intel. But only if it's pointing to aligned data. GCC defines it with `__attribute__((may_alias))` because it *can* read/write objects of other types the same way that `char*` can. See my edit on [What is \_\_m128d?](https://stackoverflow.com/q/53757633). — Peter Cordes, Dec 14 '18 at 08:45
@Bathsheba falling down to inline asm isn't good advice here. One of the points of these intrinsics types is so you can write your logic, but still let the compiler do things like register allocation, DCE, register spilling, loop-unrolling, etc, which humans are bad at but compilers are very good at. That means the humans can write the kernel of the logic using SIMD types & operations without worrying about all the fluff surrounding it. Few compilers can optimize inline asm, and those that do dont do it very well. — Mike Vine, Dec 14 '18 at 09:55
@MikeVine: Apologies, I had meant to retract that comment now we have a good answer. — Bathsheba, Dec 14 '18 at 09:56

Peter Cordes · Accepted Answer · 2018-12-14T09:04:10.023

8

__m128d is a type that assumes / requires / guarantees (to the compiler) 16-byte alignment¹.

Casting a misaligned pointer to __m128d* and dereferencing it is undefined behaviour, and this is the expected result. Use _mm_loadu_pd if your data might not be aligned. (Or preferably, align your data with alignas(16) double a[bufferSize]; ²). ISO C++11 and later have portable syntax for aligning static and automatic storage (but not as easy for dynamic storage).

Casting a pointer to __m128d* and dereferencing it is like promising the compiler that it is aligned. C++ lets you lie to the compiler, with potentially disastrous results. Doing an alignment-required operation doesn't retroactively align your data; that wouldn't make sense or even be possible when you compile multiple files separately or when you operate through pointers.

Footnote 1: Fun fact: GCC's implementation of Intel's intrinsics API adds a __m128d_u type: unaligned vectors that imply 1-byte alignment if you dereference a pointer.

typedef double __m128d_u 
       __attribute__ ((__vector_size__ (16), __may_alias__, __aligned__ (1)));

Don't use in portable code; I don't think MSVC supports this, and Intel doesn't define it.

Footnote 2: In your case, you also need every row of your 2D arrays to be aligned by 16. So you need the array dimension to be [voiceSize][round_up_to_next_power_of_2(bufferSize)] if bufferSize can be odd. Leaving unused padding element(s) at the end of every row is a common technique, e.g. in graphics programming for 2d images with potentially-odd widths.

BTW, this is not "special" or specific to intrinsics: casting a void* or char* to int* (and dereferencing it) is only safe if its sufficiently aligned. In x86-64 System V and Windows x64, alignof(int) = 4.

(Fun fact: even creating a misaligned pointer is undefined behaviour in ISO C++. But compilers that support Intel's intrinsics API must support stuff like _mm_loadu_si128( (__m128i*)char_ptr ), so we can consider creating without dereference of unaligned pointers as part of the extension.)

It usually happens to work on x86 because only 16-byte loads have an alignment-required version. But on SPARC for example, you'd potentially have the same problem. It is possible to run into trouble with misaligned pointers to int or short even on x86, though. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? is a good example: auto-vectorization by gcc assumes that some whole number of uint16_t elements will reach a 16-byte alignment boundary.

It's also easier to run into problems with intrinsics because alignof(__m128d) is greater than the alignment of most primitive types. On 32-bit x86 C++ implementations, alignof(maxalign_t) is only 8, so malloc and new typically only return 8-byte aligned memory.

edited Dec 14 '18 at 09:04

answered Dec 14 '18 at 08:38

Peter Cordes

245,674
35
423
606

@Peter Cordes so basically, for my scenario, do `alignas(16)` before any of my arrays should ensure they are aligned, and further access to them (even via pointer) should be ok, right? – markzzz Dec 14 '18 at 08:48
@markzzz: yes, for static arrays just declare them all as aligned so you can safely promise alignment to the compiler. (This is slightly more efficient than forcing the compiler to make code that works even without alignment, especially if you compile without AVX. And there are runtime penalties for loads/stores that split across the boundary between two cache-lines, even though that's minor on modern CPUs.) – Peter Cordes Dec 14 '18 at 08:52
@Peter Cordes: thanks! Also: any differences in pointer+deference (once, than access by index) or use _mm_load every time I need to access data from array to `__m128d`? Or should I open a new question? – markzzz Dec 14 '18 at 08:57
@markzzz: there's no difference. It's usually good style to use `_mm_load_pd` and `_mm_store_pd`, especially for float/double because they do the casting for you. And it makes your C++ source look more like asm. (In terms of C++ language rules and compiler behaviour, it's totally not like writing asm, but it visually looks like it to humans. I think that was part of Intel's design goal with including load/store intrinsics. In C++ language terms, they only exist to communicate alignment guarantees or lack thereof to the compiler. Only `loadu` is different from a deref.) – Peter Cordes Dec 14 '18 at 09:07
@PeterCordes: well, basically, using pointer introduce a new variable (i.e. the pointer). Isn't this "using" memory on registers that could be used only for vector data? – markzzz Dec 14 '18 at 09:16
2

@markzzz: no, integer registers are separate anyway, and the compiler can use the pointer *instead* of the actual loop counter. i.e. it can optimize away `i`. (You can help it do this by writing `const __m128d *endA = (const __m128d*)(a + bufferSize)`, and using `pA < endA` as your loop condition. But the compiler is *allowed* to make that optimization on its own, as per the as-if rule.) – Peter Cordes Dec 14 '18 at 09:25
What do you mean with "integer registers are separate anyway"? They are all XMMs, aren't them? – markzzz Dec 14 '18 at 09:32
2

@markzzz: *scalar* integer registers, like `rdi`, (aka "general purpose" registers) are used to hold pointers and integers. (In asm, a pointer is just an integer). A `__m128d*` is still just a pointer. So you'll have instructions like `add rdi, 16` / `movapd xmm0, [rdi]` / `addpd xmm0, [rsi]`. Look at the asm code-gen for your loop on https://godbolt.org/ (with optimization enabled). – Peter Cordes Dec 14 '18 at 09:37
Oh, nice! I see! So they go into seperate spaces! That's why when I see optimized code, they "loop unroll" and use 8*128bit space, totally (since other variables are outside that register scope). NIce :) Another tip learned! You are awesome, you know? – markzzz Dec 14 '18 at 09:45

Isn't __m128d aligned natively?

1 Answers1