__m128d
is a type that assumes / requires / guarantees (to the compiler) 16-byte alignment1.
Casting a misaligned pointer to __m128d*
and dereferencing it is undefined behaviour, and this is the expected result. Use _mm_loadu_pd
if your data might not be aligned. (Or preferably, align your data with alignas(16) double a[bufferSize];
2). ISO C++11 and later have portable syntax for aligning static and automatic storage (but not as easy for dynamic storage).
Casting a pointer to __m128d*
and dereferencing it is like promising the compiler that it is aligned. C++ lets you lie to the compiler, with potentially disastrous results. Doing an alignment-required operation doesn't retroactively align your data; that wouldn't make sense or even be possible when you compile multiple files separately or when you operate through pointers.
Footnote 1: Fun fact: GCC's implementation of Intel's intrinsics API adds a __m128d_u
type: unaligned vectors that imply 1-byte alignment if you dereference a pointer.
typedef double __m128d_u
__attribute__ ((__vector_size__ (16), __may_alias__, __aligned__ (1)));
Don't use in portable code; I don't think MSVC supports this, and Intel doesn't define it.
Footnote 2: In your case, you also need every row of your 2D arrays to be aligned by 16. So you need the array dimension to be [voiceSize][round_up_to_next_power_of_2(bufferSize)]
if bufferSize
can be odd. Leaving unused padding element(s) at the end of every row is a common technique, e.g. in graphics programming for 2d images with potentially-odd widths.
BTW, this is not "special" or specific to intrinsics: casting a void*
or char*
to int*
(and dereferencing it) is only safe if its sufficiently aligned. In x86-64 System V and Windows x64, alignof(int) = 4
.
(Fun fact: even creating a misaligned pointer is undefined behaviour in ISO C++. But compilers that support Intel's intrinsics API must support stuff like _mm_loadu_si128( (__m128i*)char_ptr )
, so we can consider creating without dereference of unaligned pointers as part of the extension.)
It usually happens to work on x86 because only 16-byte loads have an alignment-required version. But on SPARC for example, you'd potentially have the same problem. It is possible to run into trouble with misaligned pointers to int
or short
even on x86, though. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? is a good example: auto-vectorization by gcc assumes that some whole number of uint16_t
elements will reach a 16-byte alignment boundary.
It's also easier to run into problems with intrinsics because alignof(__m128d)
is greater than the alignment of most primitive types. On 32-bit x86 C++ implementations, alignof(maxalign_t)
is only 8, so malloc
and new
typically only return 8-byte aligned memory.