Load or shuffle a pair of floats with SIMD intrinsics for doubles?

Question

I write some optimizations for processing single precision floating-point calculation SIMD intrinsics.

Sometimes a pd double-precision instruction does what I want more easily than any ps single precision one.

Example 1:

I have pointer float prt* which point to block of floats: f0 f1 f2 f3 etc.

I want to load __m256 value with [ f0, f1, f0, f1, f0, f1, f0, f1 ]. I didn't find a 64-bit broadcast for __m256 data types. Can I use _mm256_broadcast_sd on floats?

float* ptr = ...; // pointer to some memory chunk aligned to 4 bytes
__m256 vat = _mm256_castpd_ps( _mm256_broadcast_sd( ( double* )ptr ) );

Example 2:

I have __m256 value [f0, f1, f2, f3, f4, f5, f6, f7]. Can I use shift instructions like a _mm256_srl_epi32, which take as argument __m256i values for manipulation with my __m256 value?

I check it in practice and it works, but is it a correct way to use instructions with different types?

score 4 · Accepted Answer · answered Apr 16 '21 at 08:49

Yes, vbroadcastsd is a good asm instruction for broadcasting a pair of floats, and _mm256_broadcast_sd + a cast intrinsic is a safe way to implement it in C.

Note that you aren't dereferencing (in pure C) a double* that points at float objects. You're only passing it to an intrinsic function. _mm256_set1_pd( *(double*)floatp ) would be strict aliasing undefined behaviour in C, but load/store intrinsics are defined to work regardless of what the pointer is actually pointing at. Exactly so you can easily do wide loads/stores to whatever data you actually have, not just __int64 or double.

For example, GCC's header defines _mm256_broadcastsd(const double*) as a wrapper around __builtin_ia32_vbroadcastsd256. And GCC defines _mm_loadl_epi64 to include a dereference of *(__m64_u *)__P, where __m64_u is an unaligned may-alias version of __m64 which it defines as.

typedef int __m64_u __attribute__ ((__vector_size__ (8), __may_alias__, __aligned__ (1)));

(See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?)

In general, even load/store intrinsics that take a float* or double* (instead of __m128i*) are alignment and strict-aliasing safe. (Or at least I think they're supposed to be. On some compilers there might be some which aren't actually strict-aliasing safe. So it can be a pain to get them to safely emit vpbroadcastd from a pointer that isn't actually pointing at an int, for example; I forget which intrinsic it was that found some compiler not respecting possible aliasing for.)

Your example 2 is not clear. Are you wanting to bit-shift the bit-patterns of floats? Yes, of course you can do that, that's why SIMD cast intrinsics exist to keep the C compiler happy when you want to reinterpret the same bits as a different vector type.

It's common to do that as part of implementing exp() or log for example, such as Fastest Implementation of Exponential Function Using AVX

Load or shuffle a pair of floats with SIMD intrinsics for doubles?

1 Answers1