My SSE implementation of lookAt doesn't work

Question

So, I'm writing a math library using SSE intrinsics to use with my OpenGL application. Right now I'm implementing some of the more important functions like lookAt, using the glm library to check for correctness, but for some reason my implementation of lookAt isn't working as it should.

Here's the source code:

inline void lookAt(__m128 position, __m128 target, __m128 up)
{
    /* Get the target vector relative to the camera position */
    __m128 t = vec4::normalize3(_mm_sub_ps(target, position));
    __m128 u = vec4::normalize3(up);
    /* Get the right vector by crossing target and up. */
    __m128 r = vec4::normalize3(vec4::cross(t, u));
    /* Correct the up vector by crossing right and target. */
    u = vec4::cross(r, t);
    /* Negate the target vector. */
    t = _mm_sub_ps(_mm_setzero_ps(), t);

    /* Treat the right, up, and target vector as a matrix, and transpose it. */
    /* Conveniently, this also sets the w component of all four to 0.0f */
    _MM_TRANSPOSE4_PS(r, u, t, _mm_setr_ps(0.0f, 0.0f, 0.0f, 1.0f));

    vec4 pos = _mm_sub_ps(_mm_setzero_ps(), position);
    pos.w = 1.0f;

    /* Multiply our matrix by the transposed vectors. */
    mat4 temp;
    temp.col0 = r;
    temp.col1 = u;
    temp.col2 = t;
    temp.col3 = _mm_setr_ps(0.0f, 0.0f, 0.0f, 1.0f);

    multiply(temp);
    translate(pos);
}

My matrices are column-major, stored internally as "__m128 col0, col1, col2, col3;".

I made this after reading the man pages Here for gluLookAt. Once I realized that the right, up, and target vectors looked an awful lot like a row-major matrix, it was simple for me to transpose them so I could assign them to the rotation matrix.

The code for normalize3, in case it helps:

inline static __m128 normalize3(const __m128& vec)
{
    __m128 v = _mm_mul_ps(vec, vec);
    v = _mm_add_ps(
        _mm_add_ps(
            _mm_shuffle_ps(v, v, _MM_SHUFFLE(0, 0, 0, 0)),
            _mm_shuffle_ps(v, v, _MM_SHUFFLE(1, 1, 1, 1))),
        _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 2, 2, 2)));

    return _mm_mul_ps(vec, _mm_rsqrt_ps(v));
}

It saves a couple of calls by ignoring the w component of the vector.

What am I doing wrong?

Here's some sample output. Using position(5.0, 5.0, 0.0), target(10.0, 20.0, 55.0), and up (0.0, 1.0, 0.0), I get:

From GLM:

[-0.9959] [ 0.0000] [ 0.0905] [ 4.9795]
[-0.0237] [ 0.9650] [-0.2610] [-4.7065]
[-0.0874] [-0.2621] [-0.9611] [ 1.7474]
[ 0.0000] [ 0.0000] [ 0.0000] [ 1.0000]

From my lookAt():

[-0.9959] [ 0.0000] [ 0.0905] [-5.0000]
[-0.0237] [ 0.9651] [-0.2610] [-5.0000]
[-0.0874] [-0.2621] [-0.9611] [ 0.0000]
[ 0.0000] [ 0.0000] [ 0.0000] [ 1.0000]

It seems that the only difference is in the third column, but I'm honestly not sure which of the two is correct. I'm inclined to say that GLM's is correct, since it was designed to be identical to the glu version.

EDIT: I just discovered something interesting. If I call "translate(pos);" before calling "multiply(temp);", my resulting matrix is exactly the same as glm's. Which is correct? According to the OpenGL man page on gluLookAt, this (and thus glm) is doing it backwards. Was I doing it right before, or it correct now?

Can you include the actual output matrix? That's the best starting place to debug something like this; use `gluLookAt (...)`'s output as a reference. — Andon M. Coleman, Feb 09 '14 at 23:59
Your `normalize3` does "normalize" the `w` values when I believe it should be ignoring them. — Ben Jackson, Feb 10 '14 at 00:40
That is true, and I do need to fix it, but in this case it doesn't matter because the _MM_TRANSPOSE4_PS replaces the w components with 0s anyway. — Haydn V. Harach, Feb 10 '14 at 00:49
After trying some different outputs, my version is 0.000243 off from what it should be. Is this just a case of rounding errors? — Haydn V. Harach, Feb 10 '14 at 02:26
The problem is likely in the `_mm_rsqrt_ps(v)` intrinsic. It's not accurate enough. — Z boson, Feb 10 '14 at 09:51
I just discovered something interesting. If I translate the matrix first, *then* multiply it by the rotation matrix, my result is identical to the output of glm's lookAt(). Which is correct? — Haydn V. Harach, Feb 11 '14 at 02:42

score 3 · Answer 1 · edited May 23 '17 at 12:13

One problem could be with _mm_rsqrt_ps(v). It's not very accurate. Replace it with _mm_div_ps(_mm_set1_ps(1.0f),_mm_sqrt_ps(v)). If that fixes the problem then you might be able to speed it up with some kind of root polishing Newton Raphson with SSE2 - can someone explain me these 3 lines

Another suggestion, you can make your function more SIMD friendly by not doing horizontal operations (which you do in your normalization function). Instead of normalizing the vectors before you transpose you can transpose first. This takes the vectors from (x,y,z,w) to (x,x,x,x), (y,y,y,y), (z,z,z,z), (w,w,w,w) - an Array of Structs (AoS) to a Struct of Arrays (SoA). Then you only need to do 1.0f/sqrt(rr+uu+t*t) to normalize.

__m128 t = _mm_sub_ps(target, position));
__m128 u = up;
__m128 r = vec4::cross(t, u);
u = vec4::cross(r, t);
t = _mm_sub_ps(_mm_setzero_ps(), t);
_MM_TRANSPOSE4_PS(r, u, t, _mm_setr_ps(0.0f, 0.0f, 0.0f, 1.0f));  //AoS to SoA

//now normalize
__m128 den = _mm_add_ps(_mm_add_ps(_mm_mul_ps(r,r),_mm_mul_ps(u,u)), _mm_mul_ps(t,t));
__m128 norm = _mm_div_ps(_mm_set1_ps(1.0f), _mm_sqrt_ps(den));
r= _mm_mul_ps(norm,r); u =_mm_mul_ps(norm,u); t = _mm_mul_ps(norm,t);

norm is not a single scalar. It contains the four different normalizations (n1,n2,n3,n4) so norm*r = (n1*x1, n2*x2, n3*x3, n4*x4). See this link for an efficient way to do matrix multiplication with SSE

Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

The problem with that code is that when you're normalizing "r", for instance, you're not actually normalizing the "r" vector - because it's been transposed, "r" is now actually "r.x, u.x, t.x, 0"). — Haydn V. Harach, Feb 10 '14 at 22:34
After replacing _mm_rsqrt_ps(v) with _mm_div_ps(_mm_set1_ps(1.0f), _mm_sqrt_ps(v)), I'm no longer getting rounding errors. Thanks for that tip! Unfortunately, the the 4th column is still different from what glm gives me, even when I use multiply(temp);translate(-position);, which is exactly what the man pages say. — Haydn V. Harach, Feb 10 '14 at 22:42
@HaydnV.Harach, glad I could help. Yeah, that was silly what I said about the norm after the transpose. I removed it from my answer. — Z boson, Feb 10 '14 at 22:45
@HaydnV.Harach, actually, what I said about doing the normalization after the transpose was correct. It's true that r is no a 3D vector but that's okay. Norm in the code contains four different scalars (n1,n2,n3,n4) so norm*r = (n1*r.x, n2*u.x, n3*u.y, n4*0). It's more efficient to do it this way. — Z boson, Feb 11 '14 at 09:29

score 0 · Answer 2 · answered Feb 11 '14 at 05:41

0

I figured out the problem. My multiply function was multiplying matrices in the wrong order.

answered Feb 11 '14 at 05:41

Haydn V. Harach

1,185
14
30

My SSE implementation of lookAt doesn't work

2 Answers2

Linked