fast square root optimization?

Question

If you check this very nice page:

http://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi

You'll see this program:

#define SQRT_MAGIC_F 0x5f3759df 
 float  sqrt2(const float x)
{
  const float xhalf = 0.5f*x;

  union // get bits for floating value
  {
    float x;
    int i;
  } u;
  u.x = x;
  u.i = SQRT_MAGIC_F - (u.i >> 1);  // gives initial guess y0
  return x*u.x*(1.5f - xhalf*u.x*u.x);// Newton step, repeating increases accuracy 
}

My question is: Is there any particular reason why this isn't implemented as:

#define SQRT_MAGIC_F 0x5f3759df 
 float  sqrt2(const float x)
{

  union // get bits for floating value
  {
    float x;
    int i;
  } u;
  u.x = x;
  u.i = SQRT_MAGIC_F - (u.i >> 1);  // gives initial guess y0

  const float xux = x*u.x;

  return xux*(1.5f - .5f*xux*u.x);// Newton step, repeating increases accuracy 
}

As, from disassembly, I see one MUL less. Is there any purpose to having xhalf appear at all?

If your compiler is generating one less multiply in the second case then I suspect that either (a) you haven't enabled optimisations or (b) your compiler sucks. ;-) — Paul R, Oct 23 '13 at 12:57
Maybe the author is not on his best, run some bench, if the only difference is one `MUL` the time should be a little bit less high with your code than with his. — Thomas Ayoub, Oct 23 '13 at 12:57
@PaulR Why `xhalf` at all? It appears only once, why would `xhalf` matter? — user1095108, Oct 23 '13 at 12:59
Did you enable compiler optimisations, and did you benchmark both versions ? — Paul R, Oct 23 '13 at 13:00
IMHO, without inlining, this won't be faster than the builtin sqrt(). — wildplasser, Oct 23 '13 at 13:00
Indeed, this is a very old hack which worked well on old x86 CPUs in the time of Quake *et al*, but now is only really useful on CPUs that lack a fast sqrt (or sqrt estimate) instruction, e.g. embedded microcontrollers. — Paul R, Oct 23 '13 at 13:02
@PaulR Indeed, you were right `-O3` makes the compiled code the same with `xhalf`, `xux` or without both of them. I am going to change the question. — user1095108, Oct 23 '13 at 13:05
Great - first time I've been right about something today. ;-) — Paul R, Oct 23 '13 at 13:09
My guess is that he was experimenting along the line of the comments: repeating the Newton step, and this is just his one step version. With multiple steps it would make some kind of sense. But I agree the optimizer will find opportunities to get the same result either way. — Gene, Oct 23 '13 at 13:11
That's because floating point math is not associative, look at the answer [here](http://stackoverflow.com/a/6430525/995714) — phuclv, Oct 23 '13 at 13:12
@LưuVĩnhPhúc You mean, allow the compiler to preserve accuracy by subtracting the exponent by one? — user1095108, Oct 23 '13 at 13:15
Note that if you are going to use a union at all (which Paul R has already pointed out may not be such a good idea when targeting modern architectures) and are paying the cost of moving from floating-point register to integer register anyway, you can compute `xhalf` with an integer subtraction. You want to subtract one from the exponent, that is, subtract 1<<23 from another copy of `u.x`. Note: care may need to be taken with subnormal and infinite arguments). — Pascal Cuoq, Oct 23 '13 at 13:17
@LưuVĩnhPhúc `xhalf * u.x * u.x` gives the same result as `0.5 * x * u.x * u.x` for several reasons, each of which would be enough on its own: because the order of operations are the same in both expressions, and because multiplication by 0.5 is generally exact. — Pascal Cuoq, Oct 23 '13 at 13:19
I don't understand what you mean. `xhalf*(u.x*u.x)` may cause a different result than `(xhalf*u.x)*u.x` so may be it's faster, the compiler won't optimize it unless you specify fast math — phuclv, Oct 23 '13 at 13:20
@LưuVĩnhPhúc There is no `xhalf*(u.x*u.x)` in the question. There is only `xhalf*u.x*u.x`, which is equivalent to `(xhalf*u.x)*u.x` and to `(((0.5*x)*u.x)*u.x)` per C rules. — Pascal Cuoq, Oct 23 '13 at 13:22
@PaulR BTW, with a fast `sqrt` approximation, the question is still relevant on modern architectures: http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x — user1095108, Oct 23 '13 at 13:34
ahh, I just looked over the last line and made a mistake. But I still think maybe because that's more accurate — phuclv, Oct 23 '13 at 13:35
@PascalCuoq: The first code sequence has the subexpression `xhalf*u.x` where the second has `.5f*xux`. Expanding `xhalf` and `xux` in these gives `(.5f*x)*u.x` and `.5f*(x*u.x)`. If we do not expect the compiler to know anything about the value of `u.x`, it cannot determine these are equivalent. If `x` were `FLT_MAX` and `u.x` were two, then `(.5*x)*u.x` would be `FLT_MAX` and `.5f*(x*u.x)` would be infinity. — Eric Postpischil, Oct 23 '13 at 14:09
Relevant assembly http://tinyurl.com/19542275-fast-square-root-opti . Do note that if you use GCC or ICC you actually get different assembly for each function. — kvanbere, Oct 29 '13 at 10:21
http://en.wikipedia.org/wiki/Fast_inverse_square_root or you can get a fairly close approximation for integers by calculating the most significant bit and bitshifting >> half of the bits away like `num >> (MSB(num)>>1)` (see hacker's delight for the MSB parts) — technosaurus, Nov 03 '13 at 00:32

score 4 · Answer 1 · answered Nov 23 '13 at 18:50

It could be that legacy floating point math, which used 80 bit registers, was more accurate when the multipliers where linked together in the last line as intermediate results where kept in 80 bit registers.

The first multiplication in the upper implementation takes place in parallel to the integer math that follows, they use different execution resources. The second function on the other hand looks faster but it's hard to tell if it really is because of the above. Also, the const float xux = x*u.x; statement reduces the result back to 32 bit float, which may reduce overall accuracy.

You could test these functions head to head and compare them to the sqrt function in math.h (use double not float). This way you can see which is faster and which is more accurate.

fast square root optimization?

1 Answers1

Linked