Looking for an efficient integer square root algorithm for ARM Thumb2

Question

I am looking for a fast, integer only algorithm to find the square root (integer part thereof) of an unsigned integer. The code must have excellent performance on ARM Thumb 2 processors. It could be assembly language or C code.

Any hints welcome.

Craig McQueen · Answer 1 · 2016-09-15T23:05:35.637

Integer Square Roots by Jack W. Crenshaw could be useful as another reference.

The C Snippets Archive also has an integer square root implementation. This one goes beyond just the integer result, and calculates extra fractional (fixed-point) bits of the answer. (Update: unfortunately, the C snippets archive is now defunct. The link points to the web archive of the page.) Here is the code from the C Snippets Archive:

#define BITSPERLONG 32
#define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2))

struct int_sqrt {
    unsigned sqrt, frac;
};

/* usqrt:
    ENTRY x: unsigned long
    EXIT  returns floor(sqrt(x) * pow(2, BITSPERLONG/2))

    Since the square root never uses more than half the bits
    of the input, we use the other half of the bits to contain
    extra bits of precision after the binary point.

    EXAMPLE
        suppose BITSPERLONG = 32
        then    usqrt(144) = 786432 = 12 * 65536
                usqrt(32) = 370727 = 5.66 * 65536

    NOTES
        (1) change BITSPERLONG to BITSPERLONG/2 if you do not want
            the answer scaled.  Indeed, if you want n bits of
            precision after the binary point, use BITSPERLONG/2+n.
            The code assumes that BITSPERLONG is even.
        (2) This is really better off being written in assembly.
            The line marked below is really a "arithmetic shift left"
            on the double-long value with r in the upper half
            and x in the lower half.  This operation is typically
            expressible in only one or two assembly instructions.
        (3) Unrolling this loop is probably not a bad idea.

    ALGORITHM
        The calculations are the base-two analogue of the square
        root algorithm we all learned in grammar school.  Since we're
        in base 2, there is only one nontrivial trial multiplier.

        Notice that absolutely no multiplications or divisions are performed.
        This means it'll be fast on a wide range of processors.
*/

void usqrt(unsigned long x, struct int_sqrt *q)
{
      unsigned long a = 0L;                   /* accumulator      */
      unsigned long r = 0L;                   /* remainder        */
      unsigned long e = 0L;                   /* trial product    */

      int i;

      for (i = 0; i < BITSPERLONG; i++)   /* NOTE 1 */
      {
            r = (r << 2) + TOP2BITS(x); x <<= 2; /* NOTE 2 */
            a <<= 1;
            e = (a << 1) + 1;
            if (r >= e)
            {
                  r -= e;
                  a++;
            }
      }
      memcpy(q, &a, sizeof(long));
}

I settled on the following code. It's essentially from the Wikipedia article on square-root computing methods. But it has been changed to use stdint.h types uint32_t etc. Strictly speaking, the return type could be changed to uint16_t.

/**
 * \brief    Fast Square root algorithm
 *
 * Fractional parts of the answer are discarded. That is:
 *      - SquareRoot(3) --> 1
 *      - SquareRoot(4) --> 2
 *      - SquareRoot(5) --> 2
 *      - SquareRoot(8) --> 2
 *      - SquareRoot(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRoot(uint32_t a_nInput)
{
    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    {
        one >>= 2;
    }

    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res +  2 * one;
        }
        res >>= 1;
        one >>= 2;
    }
    return res;
}

The nice thing, I discovered, is that a fairly simple modification can return the "rounded" answer. I found this useful in a certain application for greater accuracy. Note that in this case, the return type must be uint32_t because the rounded square root of 2³² - 1 is 2¹⁶.

/**
 * \brief    Fast Square root algorithm, with rounding
 *
 * This does arithmetic rounding of the result. That is, if the real answer
 * would have a fractional part of 0.5 or greater, the result is rounded up to
 * the next integer.
 *      - SquareRootRounded(2) --> 1
 *      - SquareRootRounded(3) --> 2
 *      - SquareRootRounded(4) --> 2
 *      - SquareRootRounded(6) --> 2
 *      - SquareRootRounded(7) --> 3
 *      - SquareRootRounded(8) --> 3
 *      - SquareRootRounded(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRootRounded(uint32_t a_nInput)
{
    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    {
        one >>= 2;
    }

    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res +  2 * one;
        }
        res >>= 1;
        one >>= 2;
    }

    /* Do arithmetic rounding to nearest integer */
    if (op > res)
    {
        res++;
    }

    return res;
}

Out of curiosity I benchmarked a 64-bit conversion of this against the static_casting the C library sqrt function to get an integer result, I found this to be 8.2x slower. YMMV. More data at http://onemanmmo.com/?sqrt — Robert Basler, Mar 29 '12 at 20:18
@RobertBasler: It's good that you've measured it. This sort of thing is very hardware-specific; in your case (on a processor with floating-point hardware) it was certainly worth doing a comparison. I expect these integer square-root algorithms would be more useful for embedded systems without floating-point hardware. — Craig McQueen, Mar 29 '12 at 22:39
IEEE double-precision floating point can exactly represent integers up to ~53 bits (the size of the mantissa), but beyond that the results are inexact. One advantage of integer sqrt is that it always gives exact answers. — Peter Cordes, Oct 21 '15 at 04:46
For Cortex M3 and brethren, the first loop can be substituted by a leading zero count and mask operation: one >>= clz(op) & ~0x3; Lops off a good ~30 cycles. — Spatial, Dec 30 '16 at 11:08

score 16 · Answer 2 · edited Aug 23 '19 at 13:20

If exact accuracy isn't required, I have a fast approximation for you, that uses 260 bytes of RAM (you could halve that, but don't).

int ftbl[33]={0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340};
int ftbl2[32]={ 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977};

int fisqrt(int val)
{
    int cnt=0;
    int t=val;
    while (t) {cnt++;t>>=1;}
    if (6>=cnt)    t=(val<<(6-cnt));
    else           t=(val>>(cnt-6));

    return (ftbl[cnt]*ftbl2[t&31])>>15;
}

Here's the code to generate the tables:

ftbl[0]=0;
for (int i=0;i<32;i++) ftbl[i+1]=sqrt(pow(2.0,i));
printf("int ftbl[33]={0");
for (int i=0;i<32;i++) printf(",%d",ftbl[i+1]);
printf("};\n");

for (int i=0;i<32;i++) ftbl2[i]=sqrt(1.0+i/32.0)*32768;
printf("int ftbl2[32]={");
for (int i=0;i<32;i++) printf("%c%d",(i)?',':' ',ftbl2[i]);
printf("};\n");

Over the range 1 → 2²⁰, the maximum error is 11, and over the range 1 → 2³⁰, it's about 256. You could use larger tables and minimise this. It's worth mentioning that the error will always be negative - i.e. when it's wrong, the value will be LESS than the correct value.

You might do well to follow this with a refining stage.

The idea is simple enough: (ab)^0.5 = a^0.b × b^0.5.

So, we take the input X = A×B where A = 2^N and 1 ≤ B < 2

Then we have a lookup table for sqrt(2^N), and a lookup table for sqrt(1 ≤ B < 2). We store the lookup table for sqrt(2^N) as integer, which might be a mistake (testing shows no ill effects), and we store the lookup table for sqrt(1 ≤ B < 2) as 15-bit fixed-point.

We know that 1 ≤ sqrt(2^N) < 65536, so that's 16-bit, and we know that we can really only multiply 16-bit × 15-bit on an ARM, without fear of reprisal, so that's what we do.

In terms of implementation, the while(t) {cnt++;t>>=1;} is effectively a count-leading-bits instruction (CLB), so if your version of the chipset has that, you're winning! Also, the shift instruction would be easy to implement with a bidirectional shifter, if you have one?

There's a Lg[N] algorithm for counting the highest set bit here.

In terms of magic numbers, for changing table sizes, THE magic number for ftbl2 is 32, though note that 6 (Lg[32]+1) is used for the shifting.

FWIW, though I don't really recommend this, you can quarter your overall error, with some biasing, viz: int v1=fisqrt(val); v1+=fisqrt(val-v1*v1)/16; 16 is the power of two that works best, over the range 1->2^24. — Dave Gamble, Jul 08 '09 at 22:17

score 8 · Answer 3 · edited Aug 23 '19 at 13:12

8

I find that most algorithms are based on simple ideas, but are implemented in a way more complicated manner than necessary. I've taken the idea from here: http://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf (by Ross M. Fosler) and made it into a very short C-function:

uint16_t int_sqrt32(uint32_t x)
{
    uint16_t res=0;
    uint16_t add= 0x8000;   
    int i;
    for(i=0;i<16;i++)
    {
        uint16_t temp=res | add;
        uint32_t g2=temp*temp;      
        if (x>=g2)
        {
            res=temp;           
        }
        add>>=1;
    }
    return res;
}

This compiles to 5 cycles/bit on my blackfin. I believe your compiled code will in general be faster if you use for loops instead of while loops, and you get the added benefit of deterministic time (although that to some extent depends on how your compiler optimizes the if statement.)

edited Aug 23 '19 at 13:12

phuclv

27,258
11
104
360

answered Apr 26 '12 at 09:45

Gutskalk

97
1
3

Sorry, that should be 5 cycles/bit of the output, which is half as many bits as the input. So 2.5 cycles/bit of the input. – Gutskalk Apr 26 '12 at 09:50
5

There's a small bug here. In the expression "temp*temp" you need to cast either of the operands to uint32_t to make sure the multiplication is done in 32-bit arithmetic not 16-bit. The code as-is doesn't work on AVR because of this (but it seems to on platforms where int is 32-bit, due to defaut promotion, but it may still cause integer overflow there). – Ambroz Bizjak Jun 01 '13 at 17:00
2

Another thing: "uint16_t add= 0x8000;" should be changed to "uint16_t add= UINT16_C(0x8000);". – Ambroz Bizjak Jun 01 '13 at 17:00
I didn't benchmark it, but yields correct results with the suggestions from @AmbrozBizjak, thank you both! – Violet Giraffe Jul 05 '19 at 13:03

score 8 · Answer 4 · answered Jul 08 '09 at 19:35

One common approach is bisection.

hi = number
lo = 0
mid = ( hi + lo ) / 2
mid2 = mid*mid
while( lo < hi-1 and mid2 != number ) {
    if( mid2 < number ) {
        lo = mid
    else
        hi = mid
    mid = ( hi + lo ) / 2
    mid2 = mid*mid

Something like that should work reasonably well. It makes log2(number) tests, doing log2(number) multiplies and divides. Since the divide is a divide by 2, you can replace it with a >>.

The terminating condition may not be spot on, so be sure to test a variety of integers to be sure that the division by 2 doesn't incorrectly oscillate between two even values; they would differ by more than 1.

Yazou · Answer 5 · 2013-07-28T11:16:29.520

7

It depends about the usage of the sqrt function. I often use some approx to make fast versions. For example, when I need to compute the module of vector :

Module = SQRT( x^2 + y^2)

I use :

Module = MAX( x,y) + Min(x,y)/2

Which can be coded in 3 or 4 instructions as:

If (x > y )
  Module  = x + y >> 1;
Else
   Module  = y + x >> 1;

edited Jul 28 '13 at 11:16

answered Jul 04 '12 at 13:37

Yazou

196
1
10

It should be noted that this is the alpha max plus beta min algorithm, using alpha = 1 and beta = 1/2. https://en.wikipedia.org/wiki/Alpha_max_plus_beta_min_algorithm – Chi Apr 04 '19 at 03:19

score 7 · Answer 6 · edited Aug 23 '19 at 13:12

7

It's not fast but it's small and simple:

int isqrt(int n)
{
  int b = 0;

  while(n >= 0)
  {
    n = n - b;
    b = b + 1;
    n = n - b;
  }

  return b - 1;
}

edited Aug 23 '19 at 13:12

phuclv

27,258
11
104
360

answered Aug 27 '09 at 20:54

Does this use integer overflow? – yyny Jan 31 '16 at 23:22

score 3 · Answer 7 · edited Sep 21 '13 at 22:16

3

I have settled to something similar to the binary digit-by-digit algorithm described in this Wikipedia article.

edited Sep 21 '13 at 22:16

SheetJS

20,261
12
60
74

answered Jul 08 '09 at 23:32

Ber

34,859
15
60
79

score 2 · Answer 8 · edited Aug 23 '19 at 13:11

I recently encountered the same task on ARM Cortex-M3 (STM32F103CBT6) and after searching the Internet came up with the following solution. It's not the fastest comparing with solutions offered here, but it has good accuracy (maximum error is 1, i.e. LSB on the entire UI32 input range) and relatively good speed (about 1.3M square roots per second on a 72-MHz ARM Cortex-M3 or about 55 cycles per single root including the function call).

// FastIntSqrt is based on Wikipedia article:
// https://en.wikipedia.org/wiki/Methods_of_computing_square_roots
// Which involves Newton's method which gives the following iterative formula:
//
// X(n+1) = (X(n) + S/X(n))/2
//
// Thanks to ARM CLZ instruction (which counts how many bits in a number are
// zeros starting from the most significant one) we can very successfully
// choose the starting value, so just three iterations are enough to achieve
// maximum possible error of 1. The algorithm uses division, but fortunately
// it is fast enough here, so square root computation takes only about 50-55
// cycles with maximum compiler optimization.
uint32_t FastIntSqrt (uint32_t value)
{
    if (!value)
        return 0;

    uint32_t xn = 1 << ((32 - __CLZ (value))/2);
    xn = (xn + value/xn)/2;
    xn = (xn + value/xn)/2;
    xn = (xn + value/xn)/2;
    return xn;
}

I'm using IAR and it produces the following assembler code:

        SECTION `.text`:CODE:NOROOT(1)
        THUMB
_Z11FastIntSqrtj:
        MOVS     R1,R0
        BNE.N    ??FastIntSqrt_0
        MOVS     R0,#+0
        BX       LR
??FastIntSqrt_0:
        CLZ      R0,R1
        RSB      R0,R0,#+32
        MOVS     R2,#+1
        LSRS     R0,R0,#+1
        LSL      R0,R2,R0
        UDIV     R3,R1,R0
        ADDS     R0,R3,R0
        LSRS     R0,R0,#+1
        UDIV     R2,R1,R0
        ADDS     R0,R2,R0
        LSRS     R0,R0,#+1
        UDIV     R1,R1,R0
        ADDS     R0,R1,R0
        LSRS     R0,R0,#+1
        BX       LR               ;; return

score 1 · Answer 9 · edited Aug 23 '19 at 13:12

1

Here is a solution in Java that combines integer log_2 and Newton's method to create a loop free algorithm. As a downside, it needs division. The commented lines are required to upconvert to a 64-bit algorithm.

private static final int debruijn= 0x07C4ACDD;
//private static final long debruijn= ( ~0x0218A392CD3D5DBFL)>>>6;

static
{
  for(int x= 0; x<32; ++x)
  {
    final long v= ~( -2L<<(x));
    DeBruijnArray[(int)((v*debruijn)>>>27)]= x; //>>>58
  }
  for(int x= 0; x<32; ++x)
    SQRT[x]= (int) (Math.sqrt((1L<<DeBruijnArray[x])*Math.sqrt(2)));
}

public static int sqrt(final int num)
{
  int y;
  if(num==0)
    return num;
  {
    int v= num;
    v|= v>>>1; // first round up to one less than a power of 2 
    v|= v>>>2;
    v|= v>>>4;
    v|= v>>>8;
    v|= v>>>16;
    //v|= v>>>32;
    y= SQRT[(v*debruijn)>>>27]; //>>>58
  }
  //y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  return y*y>num?y-1:y;
}

How this works: The first part produces a square root accurate to about three bits. The line y= (y+num/y)>>1; doubles the accuracy in bits. The last line eliminates the roof roots that can be generated.

edited Aug 23 '19 at 13:12

phuclv

27,258
11
104
360

answered Feb 07 '14 at 11:44

warren

523
2
11

I tried 3 other implementations on this page, this one is the fastest when I implemented in C#. Dave Gamble's implementation came in second at about 25% slower than this one. I believe most of the loop based one are just slow... – Dave Feb 09 '15 at 23:40
Yep, this is probably the fastest you can do on a CPU with division but without a FPU or extended bit manipulation instructions. It is worth noting that the 64 bit version of the algorithm may get better precision for large numbers than an IEEE 754 double on some computers. – warren Feb 13 '15 at 22:06
I haven't been able to make this work (assuming `SQRT` and `DeBruijnArray` are both `int[32]`, and adding a necessary cast to `int` to make it compile). It seems to write out of bounds during the first initialization loop. – pie3636 Mar 13 '17 at 22:00
The code is tested. The question is if I copied it correctly. One of those is an int[64] in the 64 bit version. – warren Mar 14 '17 at 23:44

Bumsik Kim · Answer 10 · 2019-08-23T14:02:12.580

If you need it just for ARM Thumb 2 processors, CMSIS DSP library by ARM is the best shot for you. It's made by people who designed Thumb 2 processors. Who else can beat it?

Actually you don't even need an algorithm but specialized square root hardware instructions such as VSQRT. The ARM company maintains math and DSP algorithm implementations highly optimized for Thumb 2 supported processors by trying to use its hardware like VSQRT. You can get the source code:

arm_sqrt_f32()
arm_sqrt_q15.c / arm_sqrt_q31.c (q15 and q31 are the fixed-point data types specialized for ARM DSP core, which is often comes with Thum 2 compatible processors.)

Note that ARM also maintains compiled binaries of CMSIS DSP that guarantees the best possible performance for ARM Thumb architecture-specific instructions. So you should consider statically link them when you use the library. You can get the binaries here.

Kim, thanks for this answer, it is certainly useful for many. However, I was looking for an integer square root. — Ber, Aug 27 '19 at 13:18

score 0 · Answer 11 · answered Sep 13 '11 at 18:34

This method is similar to long division: you construct a guess for the next digit of the root, do a subtraction, and enter the digit if the difference meets certain criteria. With a the binary version, your only choice for the next digit is 0 or 1, so you always guess 1, do the subtraction, and enter a 1 unless the difference is negative.

http://www.realitypixels.com/turk/opensource/index.html#FractSqrt

score -1 · Answer 12 · edited Aug 23 '19 at 13:10

I implemented Warren's suggestion and the Newton method in C# for 64-bit integers. Isqrt uses the Newton method, while Isqrt uses Warren's method. Here is the source code:

using System;

namespace Cluster
{
    public static class IntegerMath
    {


        /// <summary>
        /// Compute the integer square root, the largest whole number less than or equal to the true square root of N.
        /// 
        /// This uses the integer version of Newton's method.
        /// </summary>
        public static long Isqrt(this long n)
        {
            if (n < 0) throw new ArgumentOutOfRangeException("n", "Square root of negative numbers is not defined.");
            if (n <= 1) return n;

            var xPrev2 = -2L;
            var xPrev1 = -1L;
            var x = 2L;
            // From Wikipedia: if N + 1 is a perfect square, then the algorithm enters a two-value cycle, so we have to compare 
            // to two previous values to test for convergence.
            while (x != xPrev2 && x != xPrev1)
            {
                xPrev2 = xPrev1;
                xPrev1 = x;
                x = (x + n/x)/2;
            }
            // The two values x and xPrev1 will be above and below the true square root. Choose the lower one.
            return x < xPrev1 ? x : xPrev1;
        }

        #region Sqrt using Bit-shifting and magic numbers.

        // From http://stackoverflow.com/questions/1100090/looking-for-an-efficient-integer-square-root-algorithm-for-arm-thumb2
        // Converted to C#.
        private static readonly ulong debruijn= ( ~0x0218A392CD3D5DBFUL)>>6;
        private static readonly ulong[] SQRT = new ulong[64];
        private static readonly int[] DeBruijnArray = new int[64];

        static IntegerMath()
        {
          for(int x= 0; x<64; ++x)
          {
            ulong v= (ulong) ~( -2L<<(x));
            DeBruijnArray[(v*debruijn)>>58]= x;
          }
          for(int x= 0; x<64; ++x)
            SQRT[x]= (ulong) (Math.Sqrt((1L<<DeBruijnArray[x])*Math.Sqrt(2)));
        }

        public static long Isqrt2(this long n)
        {
          ulong num = (ulong) n; 
          ulong y;
          if(num==0)
            return (long)num;
          {
            ulong v= num;
            v|= v>>1; // first round up to one less than a power of 2 
            v|= v>>2;
            v|= v>>4;
            v|= v>>8;
            v|= v>>16;
            v|= v>>32;
            y= SQRT[(v*debruijn)>>58];
          }
          y= (y+num/y)>>1;
          y= (y+num/y)>>1;
          y= (y+num/y)>>1;
          y= (y+num/y)>>1;
          // Make sure that our answer is rounded down, not up.
          return (long) (y*y>num?y-1:y);
        }

        #endregion

    }
}

I used the following to benchmark the code:

using System;
using System.Diagnostics;
using Cluster;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace ClusterTests
{
    [TestClass]
    public class IntegerMathTests
    {
        [TestMethod]
        public void Isqrt_Accuracy()
        {
            for (var n = 0L; n <= 100000L; n++)
            {
                var expectedRoot = (long) Math.Sqrt(n);
                var actualRoot = n.Isqrt();
                Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = {0}.", n));
            }
        }

        [TestMethod]
        public void Isqrt2_Accuracy()
        {
            for (var n = 0L; n <= 100000L; n++)
            {
                var expectedRoot = (long)Math.Sqrt(n);
                var actualRoot = n.Isqrt2();
                Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = {0}.", n));
            }
        }

        [TestMethod]
        public void Isqrt_Speed()
        {
            var integerTimer = new Stopwatch();
            var libraryTimer = new Stopwatch();

            integerTimer.Start();
            var total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            {
                var root = n.Isqrt();
                total += root;
            }
            integerTimer.Stop();

            libraryTimer.Start();
            total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            {
                var root = (long)Math.Sqrt(n);
                total += root;
            }
            libraryTimer.Stop();

            var isqrtMilliseconds = integerTimer.ElapsedMilliseconds;
            var libraryMilliseconds = libraryTimer.ElapsedMilliseconds;
            var msg = String.Format("Isqrt: {0} ms versus library: {1} ms", isqrtMilliseconds, libraryMilliseconds);
            Debug.WriteLine(msg);
            Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg);
        }

        [TestMethod]
        public void Isqrt2_Speed()
        {
            var integerTimer = new Stopwatch();
            var libraryTimer = new Stopwatch();

            var warmup = (10L).Isqrt2();

            integerTimer.Start();
            var total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            {
                var root = n.Isqrt2();
                total += root;
            }
            integerTimer.Stop();

            libraryTimer.Start();
            total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            {
                var root = (long)Math.Sqrt(n);
                total += root;
            }
            libraryTimer.Stop();

            var isqrtMilliseconds = integerTimer.ElapsedMilliseconds;
            var libraryMilliseconds = libraryTimer.ElapsedMilliseconds;
            var msg = String.Format("isqrt2: {0} ms versus library: {1} ms", isqrtMilliseconds, libraryMilliseconds);
            Debug.WriteLine(msg);
            Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg);
        }

    }
}

My results on a Dell Latitude E6540 in Release mode, Visual Studio 2012 were that the Library call Math.Sqrt is faster.

59 ms - Newton (Isqrt)
12 ms - Bit shifting (Isqrt2)
5 ms - Math.Sqrt

I am not clever with compiler directives, so it may be possible to tune the compiler to get the integer math faster. Clearly, the bit-shifting approach is very close to the library. On a system with no math coprocessor, it would be very fast.

score -2 · Answer 13 · edited Aug 23 '19 at 13:10

I've designed a 16-bit sqrt for RGB gamma compression. It dispatches into 3 different tables, based on the higher 8 bits. Disadvantages: it uses about a kilobyte for the tables, rounds unpredictable, if exact sqrt is impossible, and, in the worst case, uses single multiplication (but only for a very few input values).

static uint8_t sqrt_50_256[] = {
  114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,
  133,134,135,136,137,138,139,140,141,142,143,143,144,145,146,147,148,149,150,
  150,151,152,153,154,155,155,156,157,158,159,159,160,161,162,163,163,164,165,
  166,167,167,168,169,170,170,171,172,173,173,174,175,175,176,177,178,178,179,
  180,181,181,182,183,183,184,185,185,186,187,187,188,189,189,190,191,191,192,
  193,193,194,195,195,196,197,197,198,199,199,200,201,201,202,203,203,204,204,
  205,206,206,207,207,208,209,209,210,211,211,212,212,213,214,214,215,215,216,
  217,217,218,218,219,219,220,221,221,222,222,223,223,224,225,225,226,226,227,
  227,228,229,229,230,230,231,231,232,232,233,234,234,235,235,236,236,237,237,
  238,238,239,239,240,241,241,242,242,243,243,244,244,245,245,246,246,247,247,
  248,248,249,249,250,250,251,251,252,252,253,253,254,254,255,255
};

static uint8_t sqrt_0_10[] = {
  1,2,3,3,4,4,5,5,5,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9,9,10,10,10,10,10,11,11,11,
  11,11,11,12,12,12,12,12,12,13,13,13,13,13,13,13,14,14,14,14,14,14,14,15,15,
  15,15,15,15,15,15,16,16,16,16,16,16,16,16,17,17,17,17,17,17,17,17,17,18,18,
  18,18,18,18,18,18,18,19,19,19,19,19,19,19,19,19,19,20,20,20,20,20,20,20,20,
  20,20,21,21,21,21,21,21,21,21,21,21,21,22,22,22,22,22,22,22,22,22,22,22,23,
  23,23,23,23,23,23,23,23,23,23,23,24,24,24,24,24,24,24,24,24,24,24,24,25,25,
  25,25,25,25,25,25,25,25,25,25,25,26,26,26,26,26,26,26,26,26,26,26,26,26,27,
  27,27,27,27,27,27,27,27,27,27,27,27,27,28,28,28,28,28,28,28,28,28,28,28,28,
  28,28,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,30,30,30,30,30,30,30,30,
  30,30,30,30,30,30,30,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,32,32,
  32,32,32,32,32,32,32,32,32,32,32,32,32,32,33,33,33,33,33,33,33,33,33,33,33,
  33,33,33,33,33,33,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,35,35,
  35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,36,36,36,36,36,36,36,36,36,
  36,36,36,36,36,36,36,36,36,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,
  37,37,37,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,39,39,39,
  39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,40,40,40,40,40,40,40,40,
  40,40,40,40,40,40,40,40,40,40,40,40,41,41,41,41,41,41,41,41,41,41,41,41,41,
  41,41,41,41,41,41,41,41,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,
  42,42,42,42,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,
  43,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,45,45,
  45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,46,46,46,46,
  46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,47,47,47,47,47,47,
  47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,48,48,48,48,48,48,48,
  48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,49,49,49,49,49,49,49,49,
  49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,50,50,50,50,50,50,50,50,
  50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,51,51,51,51,51,51,51,51,
  51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,52,52,52,52,52,52,52,
  52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,53,53
};

static uint8_t sqrt_11_49[] = {
  54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,0,76,77,78,
  0,79,80,81,82,83,0,84,85,86,0,87,88,89,0,90,0,91,92,93,0,94,0,95,96,97,0,98,0,
  99,100,101,0,102,0,103,0,104,105,106,0,107,0,108,0,109,0,110,0,111,112,113
};

uint16_t isqrt16(uint16_t v) {
  uint16_t a, b;
  uint16_t h = v>>8;
  if (h <= 10) return v ? sqrt_0_10[v>>2] : 0;
  if (h >= 50) return sqrt_50_256[h-50];
  h = (h-11)<<1;
  a = sqrt_11_49[h];
  b = sqrt_11_49[h+1];
  if (!a) return b;
  return b*b > v ? a : b;
}

I've compared it against the log2 based sqrt, using clang's __builtin_clz (which should expand to a single assembly opcode), and the library's sqrtf, called using (int)sqrtf((float)i). And got rather strange results:

$ gcc -O3 test.c -o test && ./test 
isqrt16: 6.498955
sqrtf: 6.981861
log2_sqrt: 61.755873

Clang compiled the call to sqrtf to a sqrtss instruction, which is nearly as fast as that table sqrt. Lesson learned: on x86 the compiler can provide fast enough sqrt, which is less than 10% slower than what you yourself can come up with, wasting a lot of time, or can be 10 times faster, if you use some ugly bitwise hacks. And still sqrtss is a bit slower than custom function, so if you really need these 5%, you can get them, and ARM for example has no sqrtss, so log2_sqrt shouldn't lag that bad.

On x86, where FPU is available, the old Quake hack appears to be the fastest way to calculate integer sqrt. It is 2 times faster than this table or the FPU's sqrtss.

Looking for an efficient integer square root algorithm for ARM Thumb2

13 Answers13

Linked