C Intrinsics Efficiency - Which is better?

Question

I am currently optimising a program where I need to calculate the reciprocal square root of a number of type __m128. Originally, before vectorising (and when the number was a float), it was just ans = 1.0f / sqrt(num), but now I have _mm_rsqrt_ps(num). The only problem is that this kicks my answer out by some amount when dealing with larger data sets.

I am left wondering if use of the _mm_div_ps() and _mm_sqrt_ps functions will be more accurate (though I expect take more time) and, at a side note, how to assign 1.0f to type __m128.

Thanks.

__m128 is a 4-element vector for single precision float type, you can not assign a single float to it. — user3528438, Feb 28 '16 at 17:59
and sometimes sqrt is calculated by multiplying the input with rsqrt because it's much easier this way, I suspect doing 1/sqrt helps with precision at all — user3528438, Feb 28 '16 at 18:02
So how for example would I get some `__m128` to contain the value of `1.0` such that I could use `_mm_div_ps(1_v, sqrt_v)`? — sjwarner, Feb 28 '16 at 18:03
Or you can try this http://stackoverflow.com/questions/14752399/newton-raphson-with-sse2-can-someone-explain-me-these-3-lines — user3528438, Feb 28 '16 at 18:08
It is often the case that 1/SQRT(X) can be computed faster than SQRT(X), in which case the former is a big performance win in compute-bound code. It should improve both both instruction throughput and latency. Of course you should conduct your own experiments to be sure. — Jeff Hammond, Feb 28 '16 at 19:56

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

I am left wondering if use of the _mm_div_ps() and _mm_sqrt_ps functions will be more accurate

Absolutely, because rsqrtps is not a precise operation, the whole point of it is that it's an approximation. As you can read in the manual of in the intrinsics guide,

The relative error for this approximation is:

|Relative Error| ≤ 1.5 ∗ 2⁻¹²

You may be tempted to read that as "approximately the first half of the bits in the significand are correct", but it's more annoying than that, it likes to give inexact result in cases that seem trivial. For example if you put in 4 you may get 0.499878 (actual result on my computer right now).

That does not necessarily mean you need a full square root and division. Maybe you do, but often it's good enough to use rsqrtps in combination with a refinement step (not tested):

__m128 y = _mm_rsqrt_ps(num);
__m128 yy = _mm_mul_ps(y, y);
__m128 hnum = _mm_mul_ps(num, _mm_set1_ps(0.5f));
__m128 threehalves = _mm_set1_ps(1.5f);
__m128 res = _mm_mul_ps(y, _mm_sub_ps(threehalves, _mm_mul_ps(yy, hnum)));

This this accurate to about twice as many bits as it was before. The above trick is not necessarily much of a win anymore (depending on how the code is used), on Core2 45nm where division and especially square root were horribly slow it wins hands down, but from IB and newer it's almost a tie in latency. Using sqrt and div still loses in throughput even on Skylake.

The above code also shows how to get constants in vectors.

C Intrinsics Efficiency - Which is better?

1 Answers1