TL;DR Yes.
Depending on the capabilities and deficiencies of the hardware, tool chain, and math library of your platform, this may not necessarily allow for the fastest or least painful way of computing a double-precision square root. Below I am showing a straightforward approach based on Arnold Schönhage's coupled iteration for the square root and reciprocal square root:
Starting with an approximation to the reciprocal square root rapprox ~= 1/√a, we compute s0 = a * rapprox and r0 = rapprox/2, then iterate:
si+1 = si + ri * (a - si * si)
ri+1 = ri + ri * (1 - ri * 2 * si+1)
where the si are approximations to √a and the ri are approximations to 1/(2√a). This iteration is the Newton-Raphson iteration cleverly re-arranged, and as such has quadratic convergence, meaning each step will approximately double the number of correct bits. Starting with single-precision rapprox, only two steps are needed to achieve double-precision accuracy.
If we now take advantage of the fused multiply-add operation (FMA), supported by common modern processors and usually accessible via a function fma()
, each half-step requires only two FMAs. As an added bonus, we do not need special rounding logic, because FMA computes a*b+c
using the full product a*b
, without applying any truncation or rounding. The resulting code, given here in an ISO C99 version, is short and sweet:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <fenv.h>
#include <math.h>
double my_sqrt (double a)
{
double b, r, v, w;
float bb, rr, ss;
int e, t, f;
if ((a <= 0) || isinf (a) || isnan (a)) {
if (a < 0) {
r = 0.0 / 0.0;
} else {
r = a + a;
}
} else {
/* compute exponent adjustments */
b = frexp (a, &e);
t = e - 2*512;
f = t / 2;
t = t - 2 * f;
f = f + 512;
/* map argument into the primary approximation interval [0.25,1) */
b = ldexp (b, t);
bb = (float)b;
/* compute reciprocal square root */
ss = 1.0f / bb;
rr = sqrtf (ss);
r = (double)rr;
/* Use A. Schoenhage's coupled iteration for the square root */
v = 0.5 * r;
w = b * r;
w = fma (fma (w, -w, b), v, w);
v = fma (fma (r, -w, 1), v, v);
w = fma (fma (w, -w, b), v, w);
/* map back from primary approximation interval by jamming exponent */
r = ldexp (w, f);
}
return r;
}
/* Professor George Marsaglia's 64-bit KISS PRNG */
static uint64_t xx = 1234567890987654321ULL;
static uint64_t cc = 123456123456123456ULL;
static uint64_t yy = 362436362436362436ULL;
static uint64_t zz = 1066149217761810ULL;
static uint64_t tt;
#define MWC64 (tt = (xx << 58) + cc, cc = (xx >> 6), xx += tt, cc += (xx < tt), xx)
#define XSH64 (yy ^= (yy << 13), yy ^= (yy >> 17), yy ^= (yy << 43))
#define CNG64 (zz = 6906969069ULL * zz + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
int main (void)
{
volatile union {
double f;
unsigned long long int i;
} arg, res, ref;
unsigned long long int count = 0ULL;
do {
arg.i = KISS64;
ref.f = sqrt (arg.f);
res.f = my_sqrt (arg.f);
if (res.i != ref.i) {
printf ("\n!!!! arg=% 23.16e %016llx res=% 23.16e %016llx ref=% 23.16e %016llx\n",
arg.f, arg.i, res.f, res.i, ref.f, ref.i);
}
count++;
if ((count & 0xffffff) == 0) printf ("\rtests = %llu", count);
} while (1);
return EXIT_SUCCESS;
}
An exhaustive test of this code across two consecutive binades will take a small cluster of machines about a week or so, here I have included a quick "smoke" test using random operands.
On hardware that does not support the FMA operation, fma()
will be based on emulation. This is slow, and several such emulations have proven to be faulty. The Schönhage iteration will work just fine without FMA, but additional rounding logic will have to be added in that case. Where truncating (round-to-zero) floating-point multiplies are supported, the easiest solution is use of Tuckerman rounding. Otherwise it will likely be necessary to re-interpret the double-precision argument and preliminary result into a 64-bit integer and perform the rounding with the help of integer operations.