7

Here is function that I call many times per second:

static inline double calculate_scale(double n) { //n may be int or double
    return sqrt(n) - sqrt(n-1);
}

Called in loop like:

for(double i = 0; i < x; i++) {
    double scale = calculate_scale(i);
    ...
}

And it's so slow. What is the best way to optimize this function to get as accurate output as possible?

Parameter n: Starting from 1 up, practically not limited, but mainly used with small numbers in range 1-10. It's integer (whole number), but it may be both int or double, depending on what performs better.

ttdado
  • 164
  • 1
  • 1
  • 10
  • _Everything_ is divisible by one without a remainder. Also, if you're using standard library `sqrt`, its parameter is always a floating point, so even if you pass in an integer, it will be converted accordingly. – ForceBru Feb 01 '17 at 18:03
  • 3
    I assume by "is divisible by `1` without remainder" you mean "will be an integer, whether it is passed in as an `int` or a `double`"? – R_Kapp Feb 01 '17 at 18:04
  • How slow is it? What's wrong with precomputed table? I would put an static std::map cached_result in the function. – greedy52 Feb 01 '17 at 18:05
  • Currently I am passing double into it, because I know converting `int double` takes time. But I just wanted to say, the optimized version parameter may be `double` or `int`, whatever will perform better, – ttdado Feb 01 '17 at 18:09
  • If the square roots are to be the nearest positive integer values, their difference is always either 0 or 1. – Weather Vane Feb 01 '17 at 18:12
  • 1
    Yes, input will be like `1, 2, 3 ...` or `1.0, 2.0, 3.0, ...`, just like integers. Probably may help you I am using it in classic loop `for(double i = 0; i < x; i++) { ... }`. However precomputed tables are not possible because of unlimited range. – ttdado Feb 01 '17 at 18:14
  • 10
    If you have a loop incrementing one by one, you could just save last `sqrt(n)` in a variable and use it again, couldn’t you ? – Ulysse BN Feb 01 '17 at 18:17
  • 3
    If you are using gcc, try the `-fno-math-errno` flag and check if performance improves substantially. – Matteo Italia Feb 01 '17 at 18:18
  • 2
    @ttdado Remember, you can always [edit] your post to make it clearer. This will ultimately benefit *you* by helping people make better answers. – anatolyg Feb 01 '17 at 18:20
  • Ulysse BN thanks, I really didn't think about that. such primitive thing... I was using it all the time as function and never got such idea. Many thanks, this will work perfectly. – ttdado Feb 01 '17 at 18:27
  • Glad to help! As said @anatolyg, think about editing your question with what you told us in comments for latter users. – Ulysse BN Feb 01 '17 at 18:32
  • Many platforms have1 cycle square root and inverse square root estimate functions that are good for 4 or 5 digit places. It might be worth looking for intrinsics just in case. – Michael Dorgan Feb 01 '17 at 18:33
  • 3
    Also if in a loop, you should be able to get some benefit from SIMD stuff. – Michael Dorgan Feb 01 '17 at 18:35
  • Michael Dorgan thanks for response. It's code for ARM Aarch64 (cpu cortex a53), and when the version will be fine, I would like to rewrite it in assembly. I got limited knowledge, and I am not sure how to get benefits you speak about from that platform. – ttdado Feb 01 '17 at 18:53
  • I'm not familiar with the ARM architecture. Does this processor have hardware floating point or is an emulator library used? – Bob Jarvis - Reinstate Monica Feb 01 '17 at 18:59
  • The [**Fast Inverse Square Root**](https://en.wikipedia.org/wiki/Fast_inverse_square_root) can often be used to provide amazingly fast and accurate approximations in these cases. (after all, it is how Quake was able to run blistering fast on ancient hardware...) – David C. Rankin Feb 03 '17 at 20:38
  • Thanks for copying that link from my answer :-) – ttdado Feb 03 '17 at 20:39
  • Why the hell would you think I copied the link from your answer? It actually came from my ***Basket Notepad***... – David C. Rankin Feb 03 '17 at 20:40
  • Heh. But do you think it's still worth using it on modern cpus? Because it doesn't looks fast for me when I compare that many instructions to signle fsqrt instruction. – ttdado Feb 03 '17 at 20:41

3 Answers3

4

You can try to replace it with the following approximation

sqrt(n) - sqrt(n-1) == 
(sqrt(n) - sqrt(n-1)) * (sqrt(n) + sqrt(n-1)) / (sqrt(n) + sqrt(n-1)) ==
(n - (n + 1)) / (sqrt(n) + sqrt(n-1)) ==
1 / (sqrt(n) + sqrt(n-1))

For large enough n, the last equation is pretty close to 1 / (2 * sqrt(n)). So you only have to call sqrt once. It's also worth noting that even without the approximation, the last expression is more numerically stable in terms of relative error for larger n.

StoryTeller - Unslander Monica
  • 148,497
  • 21
  • 320
  • 399
  • 6
    If we're just going to approximate `sqrt(n - 1) == sqrt(n)` for large n, we should just simplify the original function to `return 0;`. ;) – GManNickG Feb 01 '17 at 18:10
  • 2
    @GManNickG - Yeah, well :) I realized the note about the numerical stability was warranted after that suggestion, if nothing else. – StoryTeller - Unslander Monica Feb 01 '17 at 18:11
  • Well I found out approximation myself something like `0.5/sqrt(n)` or `0.5/sqrt(n-1)`, but is not good for small numbers. As numbers are bigger, precision is better. But problem is even the range is not limited, this function is mainly called with small numbers in range 1-10. – ttdado Feb 01 '17 at 18:18
  • 1
    @ttdado - Well, than the suggestion by **Ulysse BN** should work well for you. You don't need a precomputed table, just cache the last values(s) you computed. – StoryTeller - Unslander Monica Feb 01 '17 at 18:20
  • 4
    For large `n`, a much better approximation is `0.5 / sqrt(n - 0.5)`. (This should also be faster because it removes a multiplication operation.) – r3mainer Feb 01 '17 at 18:28
  • @squeamishossifrage - true, and the single operand to `sqrt` is as close to both original operands instead of just to one. – StoryTeller - Unslander Monica Feb 01 '17 at 18:30
3

First of all, thanks for all suggestions. I've done some research and found some interesting implementations and facts.

1. In Loop or Using Precomputed table

(thanks @Ulysse BN) You can optimize loop by simply saving previous sqrt(n) value. Following example demonstrates this optimization used to setup precomputed table.

    /**
     * Init variables
     *      i       counter
     *      x       number of cycles (size of table)
     *      sqrtI1  previous square root = sqrt(i-1)
     *      ptr     Pointer for next value
     */
    double i, x = sizeof(precomputed_table) / sizeof(double);
    double sqrtI1 = 0;

    double* ptr = (double*) precomputed_table;

    /**
     * Optimized calculation
     * In short:
     *      scale = sqrt(i) - sqrt(i-1)
     */
    for(i = 1; i <= x; i++) {
        double sqrtI = sqrt(i);
        double scale = sqrtI - sqrtI1; 
        *ptr++ = scale;
        sqrtI1 = sqrtI;
    }

Using precomputed table is probably the fastest method, but it's drawback may be that it's size is limited.

static inline double calculate_scale(int n) {
    return precomputed_table[n-1];
}


2. Approximation For BIG numbers using Inverse Square Root

Required Inverse (reciprocal) Square Root function rsqrt

This method has most accurate results with big numbers. With small numbers there are errors:

1    2     3      10       100     1000
0.29 0.006 0.0016 0.000056 1.58e-7 4.95e-10

Here is JS code that I used to calculate results above:

function sqrt(x) { return Math.sqrt(x); } function d(x) { return (sqrt(x)-sqrt(x-1))-(0.5/sqrt(x-0.5));} console.log(d(1), d(2), d(3), d(10), d(100), d(1000));

You can also see accuracy compared with two-sqrt version in single graph: https://www.google.com/search?q=(sqrt(x)-sqrt(x-1))-(0.5%2Fsqrt(x-0.5))

Usage:

static inline double calculate_scale(double n) {
    //Same as: 0.5 / sqrt(n-0.5)
    //but lot faster
    return 0.5 * rsqrt(n-0.5);
}

On some older cpus (with slow or no hardware square root) you may go even faster using floats and Fast inverse square root from Quake:

static inline float calculate_scale(float n) {
    return 0.5 * Q_rsqrt(n-0.5);
}

float Q_rsqrt( float number )
{
    long i;
    float x2, y;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;                       // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );               // what the fuck? 
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//  y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

    return y;
}

For more info about implementation, see https://en.wikipedia.org/wiki/Fast_inverse_square_root and http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf . Not recommended to use on modern cpus with hardware reciprocal square root.

Not always solution: 0.5 / sqrt(n-0.5)

Please note that on some processors (eg. ARM Cortex A9, Intel Core2) division takes nearly same time as hardware square root, so it's best to use original function with 2 square roots sqrt(n) - sqrt(n-1) OR reciprocal square root with multiply instead 0.5 * rsqrt(n-0.5) if exist.


3. Using Precomputed table with fallback

This method is good compromise between first 2 solutions. It has both good accuracy and performance.

static inline double calculate_scale(double n) {
    if(n <= sizeof_precomputed_table) {
        int nIndex = (int) n;
        return precomputed_table[nIndex-1];
    }
    //Multiply + Inverse Square root
    return 0.5 * rsqrt(n-0.5);
    //OR
    return sqrt(n) - sqrt(n-1);
}

In my case I need really accurate numbers, so my precomputed table size is 2048.

Any feedback is welcomed.

Community
  • 1
  • 1
ttdado
  • 164
  • 1
  • 1
  • 10
2

You stated that n is mainly a number smaller than 10. You could possibly use a precomputed table for numbers smaller than 10, or even more since it's cheap, and fallback to real calculations in case of larger numbers.

The code would look something like:

static inline double calculate_scale(double n) { //n may be int or double
    if (n <= 10.0 && n == floor(n)) {
        return precomputed[(int) n]
    }
    return sqrt(n) - sqrt(n-1);
}
Mateo Hrastnik
  • 473
  • 6
  • 18