Naturally it's going to depend a lot on your code, but I've implemented two simple functions using both approaches. See code
__m128 calc_set1(float num1, float num2)
{
__m128 num1_4 = _mm_set1_ps(num1);
__m128 num2_4 = _mm_set1_ps(num2);
__m128 result4 = _mm_mul_ps(num1_4, num2_4);
return result4;
}
__m128 calc_mov(float* num1_4_addr, float* num2_4_addr)
{
__m128 num1_4 = _mm_load_ps(num1_4_addr);
__m128 num2_4 = _mm_load_ps(num2_4_addr);
__m128 result4 = _mm_mul_ps(num1_4, num2_4);
return result4;
}
and assembly
calc_set1(float, float):
shufps $0, %xmm0, %xmm0
shufps $0, %xmm1, %xmm1
mulps %xmm1, %xmm0
ret
calc_mov(float*, float*):
movaps (%rdi), %xmm0
mulps (%rsi), %xmm0
ret
You can see that the calc_mov()
does as what you'd expect and the calc_set1()
uses a single shuffle instruction.
A movps
instruction can take approximately four cycles for the address generation + more if the load port of the L1 cache is busy + more in the rare event of a cache miss.
shufps
will take a single cycle on any of the recent Intel microarchitectures. I believe this is true whether it's for SSE128 or AVX256. Therefore I would suggest using the mm_set1_ps
approach.
Of course, a shuffle instruction assumes the float is already in an SSE/AVX register. In the event that you're loading it from memory, then the broadcast will be better since it will capture the best of movps
and shufps
in a single instruction.