0

To expand on the idea, let's say I have 2 32-bit registers representing the high and low bits of a 64-bit floating point. I want to calculate the 64-bit square root on them. However, while I don't have a 64-bit square root function, I do have a 32-bit square root function.

My question is this: does having a 32-bit square root at my disposal help me if I want to calculate that 64-bit square root, or am I sort of stuck trying to do a converging loop like Newton-Raphson or something like that?

phuclv
  • 27,258
  • 11
  • 104
  • 360
MNagy
  • 373
  • 4
  • 18
  • On this platform, do you also have a single-precision reciprocal square root available to you? Or at least a single-precision reciprocal? What about fused multiply-add (FMA)? Is that available? – njuffa Nov 02 '18 at 21:41
  • Could you clarify? From a comment to an answer, you said you have single precision floating point operations available. And want to compute the square root of a 64bit (signed) integer? Or simulate the square root of a 64bit double precision floating point? – Lutz Lehmann Nov 07 '18 at 08:09

3 Answers3

3

TL;DR Yes.

Depending on the capabilities and deficiencies of the hardware, tool chain, and math library of your platform, this may not necessarily allow for the fastest or least painful way of computing a double-precision square root. Below I am showing a straightforward approach based on Arnold Schönhage's coupled iteration for the square root and reciprocal square root:

Starting with an approximation to the reciprocal square root rapprox ~= 1/√a, we compute s0 = a * rapprox and r0 = rapprox/2, then iterate:

si+1 = si + ri * (a - si * si)
ri+1 = ri + ri * (1 - ri * 2 * si+1)

where the si are approximations to √a and the ri are approximations to 1/(2√a). This iteration is the Newton-Raphson iteration cleverly re-arranged, and as such has quadratic convergence, meaning each step will approximately double the number of correct bits. Starting with single-precision rapprox, only two steps are needed to achieve double-precision accuracy.

If we now take advantage of the fused multiply-add operation (FMA), supported by common modern processors and usually accessible via a function fma(), each half-step requires only two FMAs. As an added bonus, we do not need special rounding logic, because FMA computes a*b+c using the full product a*b, without applying any truncation or rounding. The resulting code, given here in an ISO C99 version, is short and sweet:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <fenv.h>
#include <math.h>

double my_sqrt (double a)
{
    double b, r, v, w;
    float bb, rr, ss;
    int e, t, f;

    if ((a <= 0) || isinf (a) || isnan (a)) {
        if (a < 0) {
            r = 0.0 / 0.0;
        } else {
            r = a + a;
        }
    } else {
        /* compute exponent adjustments */
        b = frexp (a, &e);
        t = e - 2*512;
        f = t / 2;
        t = t - 2 * f;
        f = f + 512;
        /* map argument into the primary approximation interval [0.25,1) */
        b = ldexp (b, t);
        bb = (float)b;
        /* compute reciprocal square root */
        ss = 1.0f / bb;
        rr = sqrtf (ss);
        r = (double)rr;
        /* Use A. Schoenhage's coupled iteration for the square root */
        v = 0.5 * r;
        w = b * r;
        w = fma (fma (w, -w, b), v, w);
        v = fma (fma (r, -w, 1), v, v);
        w = fma (fma (w, -w, b), v, w);
        /* map back from primary approximation interval by jamming exponent */
        r = ldexp (w, f);
    }
    return r;
}

/* Professor George Marsaglia's 64-bit KISS PRNG */
static uint64_t xx = 1234567890987654321ULL;
static uint64_t cc = 123456123456123456ULL;
static uint64_t yy = 362436362436362436ULL;
static uint64_t zz = 1066149217761810ULL;
static uint64_t tt;
#define MWC64  (tt = (xx << 58) + cc, cc = (xx >> 6), xx += tt, cc += (xx < tt), xx)
#define XSH64  (yy ^= (yy << 13), yy ^= (yy >> 17), yy ^= (yy << 43))
#define CNG64  (zz = 6906969069ULL * zz + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

int main (void)
{
    volatile union {
        double f;
        unsigned long long int i;
    } arg, res, ref;
    unsigned long long int count = 0ULL;

    do {
        arg.i = KISS64;
        ref.f = sqrt (arg.f);
        res.f = my_sqrt (arg.f);
        if (res.i != ref.i) {
            printf ("\n!!!! arg=% 23.16e %016llx  res=% 23.16e %016llx  ref=% 23.16e %016llx\n",
                    arg.f, arg.i, res.f, res.i, ref.f, ref.i);
        }
        count++;
        if ((count & 0xffffff) == 0) printf ("\rtests = %llu", count);
    } while (1);
    return EXIT_SUCCESS;
}

An exhaustive test of this code across two consecutive binades will take a small cluster of machines about a week or so, here I have included a quick "smoke" test using random operands.

On hardware that does not support the FMA operation, fma() will be based on emulation. This is slow, and several such emulations have proven to be faulty. The Schönhage iteration will work just fine without FMA, but additional rounding logic will have to be added in that case. Where truncating (round-to-zero) floating-point multiplies are supported, the easiest solution is use of Tuckerman rounding. Otherwise it will likely be necessary to re-interpret the double-precision argument and preliminary result into a 64-bit integer and perform the rounding with the help of integer operations.

njuffa
  • 19,228
  • 3
  • 59
  • 102
1

Probably no

Assuming we have the two parts of a 64-bit integer variable i64 as hi and lo then

sqrt(i64) = sqrt(hi*232 + lo)

We don't have a way to simplify the square root of a sum to another expression, so we can't calculate a 64-bit square root from a 32-bit square root

But you said that you have a 64-bit floating-point value. Are you on a platform without FPU? Is your 32-bit square root a floating-point or integer function? The same problem occurs anyway since the mantissa doesn't fit in a single register, but you can get some approximations if full precision is not needed

phuclv
  • 27,258
  • 11
  • 104
  • 360
  • I'm on a GPU, and I'm working in DirectX Assembly. The 32 bit square root is a floating-point function. Unfortunately, an approximation is useless for my purposes, as this is all part of a cryptographic hash function--that is, a single bit out of place will invalidate the output. – MNagy Oct 30 '18 at 05:01
  • @MNagy hmm and I am betting your used values are integer ... making even your 32bit float into 23bit value or worse on GPU... what about computing the value yourself completely? See this: [How to get a square root for 32 bit input in one clock cycle only?](https://stackoverflow.com/a/34657972/2521214) it can be easily modified into 64 bit ... and it does not use multiplication or any other slow operation. Its not approximating either .. You just double the LUT size and iteration cycles ... even the LUT can be get rid off by shifting a value – Spektre Oct 30 '18 at 09:07
1

You still have to program Newton-Raphson, but you can save a lot of iterations by using the 32-bit square root to work out a 32-bit approximation and use that as the starting value for Newton-Raphson, which means it will converge on the exactly correct answer in fewer iterations. This is a worthwhile time saving - hardware square root sometimes uses a table lookup to find starting points for Newton-Raphson and the best theoretical complexity calculations assume that you use lower precision for the earlier iterations to save time.

mcdowella
  • 18,736
  • 2
  • 17
  • 24