I am left wondering if use of the _mm_div_ps() and _mm_sqrt_ps functions will be more accurate
Absolutely, because rsqrtps
is not a precise operation, the whole point of it is that it's an approximation. As you can read in the manual of in the intrinsics guide,
The relative error for this approximation is:
|Relative Error| ≤ 1.5 ∗ 2−12
You may be tempted to read that as "approximately the first half of the bits in the significand are correct", but it's more annoying than that, it likes to give inexact result in cases that seem trivial. For example if you put in 4 you may get 0.499878 (actual result on my computer right now).
That does not necessarily mean you need a full square root and division. Maybe you do, but often it's good enough to use rsqrtps
in combination with a refinement step (not tested):
__m128 y = _mm_rsqrt_ps(num);
__m128 yy = _mm_mul_ps(y, y);
__m128 hnum = _mm_mul_ps(num, _mm_set1_ps(0.5f));
__m128 threehalves = _mm_set1_ps(1.5f);
__m128 res = _mm_mul_ps(y, _mm_sub_ps(threehalves, _mm_mul_ps(yy, hnum)));
This this accurate to about twice as many bits as it was before. The above trick is not necessarily much of a win anymore (depending on how the code is used), on Core2 45nm where division and especially square root were horribly slow it wins hands down, but from IB and newer it's almost a tie in latency. Using sqrt and div still loses in throughput even on Skylake.
The above code also shows how to get constants in vectors.