SSE normalization slower than simple approximation?

Question

I am trying to normalize a 4d vector.

My first approch was to use SSE intrinsics - something that provided a 2 times speed boost to my vector arithmetic. Here is the basic code: (v.v4 is the input) (using GCC) (all of this is inlined)

//find squares
v4sf s = __builtin_ia32_mulps(v.v4, v.v4);
//set t to square
v4sf t = s;
//add the 4 squares together
s   = __builtin_ia32_shufps(s, s, 0x1B);
t      = __builtin_ia32_addps(t, s);
s   = __builtin_ia32_shufps(s, s, 0x4e);
t      = __builtin_ia32_addps(t, s);
s   = __builtin_ia32_shufps(s, s, 0x1B);
t      = __builtin_ia32_addps(t, s);
//find 1/sqrt of t
t      = __builtin_ia32_rsqrtps(t);
//multiply to get normal
return Vec4(__builtin_ia32_mulps(v.v4, t));

I check the disassembly and it looks like how I would expect. I don't see any big problems there.

Anyways, then I tried it using an approximation: (I got this from google)

float x = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float xhalf = 0.5f*x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f3759df - (i>>1); // give initial guess y0
x = *(float*)&i; // convert bits back to float
x *= 1.5f - xhalf*x*x; // newton step, repeating this step
// increases accuracy
//x *= 1.5f - xhalf*x*x;
return Vec4(v.w*x, v.x*x, v.y*x, v.z*x);

It is running slightly faster than the SSE version! (about 5-10% faster) It's results also are very accurate - I would say to 0.001 when finding length! But.. GCC is giving me that lame strict aliasing rule because of the type punning.

So I modify it:

union {
    float fa;
    int ia;
};
fa = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float faHalf = 0.5f*fa;
ia = 0x5f3759df - (ia>>1);
fa *= 1.5f - faHalf*fa*fa;
//fa *= 1.5f - faHalf*fa*fa;
return Vec4(v.w*fa, v.x*fa, v.y*fa, v.z*fa);

And now the modified version (with no warnings) is running slower!! It's running almost 60% the speed that SSE version runs (but same result)! Why is this?

So here is question(s):

Is my SSE implentation correct?
Is SSE really slower than normal fpu operations?
Why the hell is the 3rd code so much slower?

It would help to know what CPU you are using. E.g. old x86 CPUs (pre Core 2) had very poor SSE capabilities. — Paul R, Feb 01 '11 at 19:17
Duplicate of http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x ? — celion, Feb 01 '11 at 19:27
You must have something wrong (most likely the shuffle/add). RSQRTPS is supposed to be correct to 11 bits. — Suma, Feb 01 '11 at 19:30
@Pepe: older x86 CPUs only had 64 bit SSE execution units (128 bit operations are broken down into two 64 bit µops) so you typically won't see much speed gain for floating point SIMD versus scalar code. Try your code on a newer CPU - e.g. Core 2 Duo, Core i5/i7, which have full 128 bit SSE units. — Paul R, Feb 01 '11 at 19:33
@Suma: Yeah, I got confsued and thought that shuf was rotate-through-carry. I fixed it and now it is accurate, but still not faster! I will try on another computer. — Pubby, Feb 01 '11 at 19:59
Since it's c++, why not use reinterpret_cast instead of the union? — tauran, Feb 01 '11 at 20:08
I think you should be able to sum all four components using two shuffles and two adds only: 1,2,3,4 shuffle to 2,2,4,4 add to 1+2,x, 3+4,x shuffle to 3+4,x,3+4,x, add to 1+2+3+4,x,x,x. Then you can use scalar builtin_ia32_rsqrtss, shuffle it into all slots and finally multiply. — Suma, Feb 01 '11 at 21:48

score 2 · Accepted Answer · answered Feb 01 '11 at 20:10

I am a dope - I realized I had SETI@Home running while benchmarking. I'm guessing it was killing my SSE performance. Turned it off and got it running twice as fast.

I also tested it on an AMD athlon and got the same results - SSE was faster.

At least I fixed the shuf bug!

zr. · Answer 2 · 2011-02-01T22:04:31.243

Here is the most efficient assembly code i can think of. You can compare this to what your compiler generates. assume the input and output are in XMM0.

       ; start with xmm0 = { v.x v.y v.z v.w }
       movaps  %xmm0, %mm1         ; save it till the end
       mulps   %xmm0, %xmm0        ; v=v*v
       pshufd  $1, %xmm0, %xmm1    ; xmm1 = { v.y v.x v.x v.x }
       addss   %xmm0, %xmm1        ; xmm1 = { v.y+v.x v.x v.x v.x }
       pshufd  $3, %xmm0, %xmm2    ; xmm2 = { v.w v.x v.x v.x }
       movhlps %xmm0, %xmm3        ; xmm3 = { v.z v.w ? ? }
       addss   %xmm1, %xmm3        ; xmm3 = { v.y+v.x+v.z v.x ? ? }
       addss   %xmm3, %xmm2        ; xmm2 = { v.y+v.x+v.z+v.w v.x v.x v.x }
       rsqrtps  %xmm2, %xmm1        ; xmm1 = { rsqrt(v.y+v.x+v.z+v.w) ... }
       pshufd  $0, %xmm1, %xmm1    ; xmm1 = { rsqrt(v.y+v.x+v.z+v.w) x4 }
       mulps   %xmm1, %xmm0       
       ; end with xmm0 = { v.x*sqrt(...) v.y*sqrt(...) v.z*sqrt(...) v.w*sqrt(...) }

score 0 · Answer 3 · answered Feb 01 '11 at 19:31

0

My guess is that the 3rd version is slower because the compiler decides to put the union in a memory variable. In the cast case, it can copy the values from register to register. You can just look at the generated machine code.

As to why SSE is inaccurate, I don't have an answer. It would help if you can give real numbers. If the difference is 0.3 on a vector of size 1, that would be outrageous.

answered Feb 01 '11 at 19:31

The x87 fpu is more accurate cause it internally computes using 80bit floating point values. – Trass3r Feb 17 '11 at 12:49

SSE normalization slower than simple approximation?

3 Answers3

Linked