For for an SSE vector that has all the same components, generate on the fly or precompute?

Question

When I need to do an vector operation that has an operand that is just a float broadcasted to every component, should I precompute the __m256 or __m128, and load it when I need it, or broadcast the float to the register using _mm_set1_ps every time I need the vector?

I have been precomputing the vectors that are very important and highly used and generating on the fly the ones that are less important. But am I really gaining any speed with precomputing? Is it worth the trouble?

Is the _mm_set1_ps implemented with a single instruction? That might answer my question.

When you say "load" I assume you mean by copying an existing (register) variable, right? Not from memory? — user541686, Aug 05 '15 at 21:43
@Mehrdad Well, mostly I mean from memory, unless it happens to be in the register already. For the precompute option, I mean set a __m128 somewhere in your code, and then reference it wherever you need it, instead of generating the same vector with _mm_set1_ps. That's the simplest explanation. — Thomas, Aug 05 '15 at 21:46
It's tricky to say outright since `mm_set1_ps` doesn't correspond to any single instruction. It might even translate to what you're suggesting doing. Why not check the assembly and see? Also, since you tagged your post with AVX you might like to know there's a newer intrinsic that corresponds to a single instruction: `_mm_broadcast_ss` — hayesti, Aug 05 '15 at 21:54
@hayesti I did not know about that instruction. In the case of having avx available, that instruction is always favorable right? — Thomas, Aug 05 '15 at 21:57
@hayesti: Just use `set1` in most cases. The compiler will choose the best way to broadcast. (Note that AVX1 only provides the memory-source `vbroadcastps` (uses the load port only). AVX2 is requires for `vbroadcastps ymm, xmm` (uses the shuffle port)). — Peter Cordes, Aug 05 '15 at 22:26
set1 usually compiles to 1 or 2 instructions depending on the situation and whether you have AVX2. It's fairly cheap. — Mysticial, Aug 05 '15 at 22:34

score 3 · Answer 1 · answered Aug 05 '15 at 21:51

I believe it's generally best to factor out your SSE vector from your code (e.g. loop), and use it whenever you need to, assuming you take care not to accidentally force it into memory. (For example, if you take its address or pass it by reference to another function, then it may be forced into memory and you may get odd behavior.)
The idea is that usually it is best to avoid transferring values into and out of SSE registers, and if it happens that this isn't the case in your particular situation, the compiler already knows how the value was constructed, and could rematerialize it if need be. I think this is a lot easier than loop-invariant code motion in general, which is the reverse optimization (i.e. where the compiler factors it out for you) and which requires the compiler to prove that the code is indeed loop-invariant.

score 3 · Answer 2 · edited May 23 '17 at 11:58

I was playing around with broadcasts for an answer to fastest way to fill a vector (SSE2) with a certain value. Templates friendly. Have a look some some asm dumps of broadcasts.

set1 every time it's used shouldn't make much difference, as long as the compiler knows the value to be broadcast doesn't alias anything. (If the compiler can't assume it doesn't alias, it will have to redo the broadcast after every write to an array or pointer that might alias.)

It's usually good style to store the set1 result in a named variable. If the compiler runs out of vector registers, it might spill the vector to the stack, and reload later, or it might re-broadcast. I'm not sure if coding style will influence this decision.

I wouldn't use a static const variable to cache it between calls to a function. (That can lead to the compiler generating code to check if the variable was already initialized every call.)

Broadcasts of compile-time constants sometimes result in compile-time broadcasts, so your code just has 16B of const data sitting in memory.

AVX1 broadcasts of a value already in a register is the worst-case. AVX1 only provides the memory-source vbroadcastps (uses the load port only). A broadcast takes a shufps / vinsertf128.

AVX2 is required for vbroadcastps ymm, xmm (uses the shuffle port)).

hayesti · Accepted Answer · 2015-08-05T22:44:52.093

Naturally it's going to depend a lot on your code, but I've implemented two simple functions using both approaches. See code

__m128 calc_set1(float num1, float num2)
{
  __m128 num1_4 = _mm_set1_ps(num1);
  __m128 num2_4 = _mm_set1_ps(num2);
  __m128 result4 = _mm_mul_ps(num1_4, num2_4);

  return result4;
}

__m128 calc_mov(float* num1_4_addr,  float* num2_4_addr)
{
   __m128 num1_4 = _mm_load_ps(num1_4_addr);
  __m128 num2_4 = _mm_load_ps(num2_4_addr);
  __m128 result4 = _mm_mul_ps(num1_4, num2_4);

  return result4;
}

and assembly

calc_set1(float, float):
    shufps  $0, %xmm0, %xmm0
    shufps  $0, %xmm1, %xmm1
    mulps   %xmm1, %xmm0
    ret
calc_mov(float*, float*):
    movaps  (%rdi), %xmm0
    mulps   (%rsi), %xmm0
    ret

You can see that the calc_mov() does as what you'd expect and the calc_set1() uses a single shuffle instruction.

A movps instruction can take approximately four cycles for the address generation + more if the load port of the L1 cache is busy + more in the rare event of a cache miss.

shufps will take a single cycle on any of the recent Intel microarchitectures. I believe this is true whether it's for SSE128 or AVX256. Therefore I would suggest using the mm_set1_ps approach.

Of course, a shuffle instruction assumes the float is already in an SSE/AVX register. In the event that you're loading it from memory, then the broadcast will be better since it will capture the best of movps and shufps in a single instruction.

For for an SSE vector that has all the same components, generate on the fly or precompute?

3 Answers3

Linked