0

I want to load __m256 directly from Armadillo vector data with .memptr(). Does Armadillo ensure the data memory is 256-bits aligned? If it is then I would just convert the float/double pointer returned by .memptr() to __m256 pointer and skip the _mm256_load_ps(), if it makes sense in terms of performance.

Jérôme Richard
  • 8,011
  • 1
  • 9
  • 30
Noob
  • 55
  • 7

1 Answers1

1

The Armadillo do not seems to talk about this point in the documentation so it is left unspecified. Thus, vector data are likely not ensured to be 32-bytes aligned.

However, you do not need vector data to be aligned to load them in AVX registers: you can use the unaligned load intrinsic _mm256_loadu_ps. AFAIK, the performance of _mm256_load_ps and _mm256_loadu_ps is about the same on relatively-new x86 processors.

Jérôme Richard
  • 8,011
  • 1
  • 9
  • 30
  • Thanks. I am curious what happens when I convert `float *` to `a __m256*` with `reinterpret_cast` then simply dereference the pointer, does the compiler automatically pick `_mm256_loadu_ps`? – Noob Mar 08 '21 at 17:24
  • Using a `reinterpret_cast` on data not aligned with `sizeof(Type)` makes your program ill-formed in standard C++. Thus, loading 32-bytes unaligned data using such cast causes an undefined behaviour. Actually, I advise you not to use a `reinterpret_cast` even when data are aligned. You can find more information about that [here](https://stackoverflow.com/questions/52112605). – Jérôme Richard Mar 08 '21 at 17:43
  • 1
    @Noob: `alignof(__m256) == 32`, so deref is equivalent to `load`, not `loadu`. Fun fact: compilers like GNU C can implement `loadu` with a deref of an `__attribute__((aligned(1),may_alias,vector_size(32)))` type. [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605). For integer data you typically need some kind of cast (because Intel annoyingly defined the intrinsics to take `__m256i*` instead of `void*`), but yeah for float just use `_mm256_loadu_ps(const float*)`. – Peter Cordes Mar 09 '21 at 01:33
  • If your data happens to be 32-byte aligned at runtime, then great, otherwise the HW handles it; the penalty is non-existent if all 32 bytes come from the same cache line. But cache-line splits have extra latency for OoO exec to hide and worse throughput from doing 2 cache accesses. Page-splits are even worse especially on CPUs before Skylake, but of course happen rarely for sequential loops. – Peter Cordes Mar 09 '21 at 01:36