Approximating cosine on [0,pi] using only single precision floating point

Question

i'm currently working on an approximation of the cosine. Since the final target device is a self-developement working with 32 bit floating point ALU / LU and there is a specialized compiler for C, I am not able to use the c library math functions (cosf,...). I'm aiming to code various methods that differ in terms of accuracy and number of instructions / cycles.

I've already tried a lot of different approximation algorithms, starting from fdlibm, taylor expansion, pade approximation, remez algorithm using maple and so on....

But as soon as I implement them using only float precision, there is a significant loss of precision. And be sure: I know that with double precision, a much higher precision is no problem at all...

Right now, i have some approximations which are exact up to a few thousand ulp around pi/2 (the range where the largest errors occur), and i feel that i am limited by the single precision conversions.

To address the topic argument reduction: input is in radian. i assume that an argument reduction will cause even more precision loss due to divisions / multiplications.... since my overall input range is only 0..pi, i decided to reduce the argument to 0..pi/2.

Therefore my question is: Does anybody know a single precision approximation to cosine function with high accuracy (and in the best case high efficiency)? Are there any algorithms that optimize approximations for single precision? Do you know whether the built-in cosf function computes the values with single oder double precision internally? ~

float ua_cos_v2(float x)
{
    float output;
    float myPi = 3.1415927410125732421875f;
    if (x < 0) x = -x;
    int quad = (int32_t)(x*0.63661977236f);//quad = x/(pi/2) = x*2/pi
    if (x<1.58f && x> 1.57f) //exclude approximation around pi/2
    {
        output = -(x - 1.57079637050628662109375f) - 2.0e-12f*(x - 1.57079637050628662109375f)*(x - 1.57079637050628662109375f) + 0.16666667163372039794921875f*(x - 1.57079637050628662109375f)*(x - 1.57079637050628662109375f)*(x - 1.57079637050628662109375f) + 2.0e-13f*(x - 1.57079637050628662109375f)*(x - 1.57079637050628662109375f)*(x - 1.57079637050628662109375f)*(x - 1.57079637050628662109375f)+ 0.000198412701138295233249664306640625f*(x - 1.57079637050628662109375f)*(x - 1.57079637050628662109375f)*(x - 1.57079637050628662109375f)*(x - 1.57079637050628662109375f)*(x - 1.57079637050628662109375f)*(x - 1.57079637050628662109375f)*(x - 1.57079637050628662109375f);
        output -= 4.37E-08f;
    }
    else {
        float param_x;
        int param_quad = -1;
        switch (quad)
        {
        case 0:
            param_x = x;
            break;
        case 1:
            param_x = myPi - x;
            param_quad = 1;
            break;
        case 2:
            param_x = x - myPi;
            break;
        case 3:
            param_x = 2 * myPi - x;
            break;
        }
        float c1 = 1.0f,
            c2 = -0.5f,
            c3 = 0.0416666679084300994873046875f,
            c4 = -0.001388888922519981861114501953125f,
            c5 = 0.00002480158218531869351863861083984375f,
            c6 = -2.75569362884198199026286602020263671875E-7f,
            c7 = 2.08583283978214240050874650478363037109375E-9f,
            c8 = -1.10807162057025010426514199934899806976318359375E-11f;
        float _x2 = param_x * param_x;
        output = c1 + _x2*(c2 + _x2*(c3 + _x2*(c4 + _x2*(c5 + _x2*(c6 + _x2*(c7 
        + _x2* c8))))));
        if (param_quad == 1 || param_quad == 0)
            output = -output;
    }
    return output;
}

~

if I have forgotten any information, please do not hesitate to ask!

Thanks in advance

What precision do you need? Please show the code for the approximation with unsufficient precision. Maybe there are ways to improve the precision. (We can't tell without seeing the code.) Please [edit] your question to add this information, don't use comments to answer. — Bodo, Sep 16 '20 at 11:39
Is your actual input in radians, or do you actually want to calculate `cos(x*pi)` for `0<=x<=1`? In any case, before applying any sort of polynomial approximation, you should reduce the input range to `[-pi/4, pi/4]` and use identities such as `cos(x+pi/2) = -sin(x)`. — chtz, Sep 16 '20 at 11:41
For x near π/2, which is where you say the highest error is, cos(x) is near π/2−x. This means that approximating it with a polynomial is easy. Specifically, you should use y=π/2−x and then a specific polynomial for this case is y, but even a more general polynomial of some form like y+c3•y^3+c5•y^5+… will have little calculation error as the high-order terms will be effectively zero. Where the error occurs is in calculating y=π/2−x. If this is done with high precision, y is a result accurate within a fraction of an ULP near π/2. If done with `float` precision, the error is huge. — Eric Postpischil, Sep 16 '20 at 12:58
For this specific case, you might consider storing π/2 in two parts. The first, p0, is π/2 rounded to a `float`. The second, p1, is π/2−p0 (precalculated, with the result written into the source code). Then π/2−x can be calculated accurately in `float` precision with `p0-x+p1`. When x is the `float` nearest π/2, this has an error about ⅓ ULP. Beyond that, we would need to see the code you are using. — Eric Postpischil, Sep 16 '20 at 13:01
Thanks for your quick answers. i hope i addressed all the remaining information. @EricPostpischil: your approach might be worth a try, especially with the pi/2 stored in two floats... — Dexter S, Sep 16 '20 at 13:26
Have you already tried [CORDIC](https://en.wikipedia.org/wiki/CORDIC)? — Bob__, Sep 16 '20 at 13:28
@Bob__ yes and no. I read a lot about CORDIC, i think that even fdlibm [link](https://www.netlib.org/fdlibm/) might use CORDIC algorithms at its core. and i havent found a good implementation for single precision in C yet... — Dexter S, Sep 16 '20 at 15:08
@EricPostpischil just tried your suggestion which indeed works very well using following code: `int b = 0x3fc90fdb; //pi/2 float p0 = *((float*)&b); float p1 = -4.371139000186242830836e-8f; float cos = p0 - x + p1; ` , at least for very small intervals like 1.5707f .. 1.5709f. Do you have any other ideas for approximations to use on the rest of the interval from lets say 1.4 to 1.9? — Dexter S, Sep 16 '20 at 15:16
"with high accuracy (and in the best case high efficiency)" --> which is more important? I would suggest _fastest_ as long as the accuracy is <= N ULP, where N is your choice in the [1...2] range. — chux - Reinstate Monica, Sep 16 '20 at 17:50
Do you need sine()? Often best to code a helper version of `helper_sin(sub_range_x)` and `helper_cos()` to solve `my_cos(wider_x_range)`? — chux - Reinstate Monica, Sep 16 '20 at 18:00
If you are looking for speed, it can be faster to code `cosd(some_degree_x)` rather than `cos()` as the argument reduction is easier to do rapidly and accurately than starting with radians. — chux - Reinstate Monica, Sep 16 '20 at 18:15
@DexterS Does your target hardware provide a fused multiply-add operation (FMA), or just floating-point multiply and add? — njuffa, Sep 17 '20 at 02:07
accuracy ist at first more important, since we have only 1024 instructions to store in IMEM, efficiency might limit accuracy. input has to be in radian... — Dexter S, Sep 17 '20 at 08:09

njuffa · Accepted Answer · 2020-09-23T00:11:01.497

It is certainly possible to compute cosine on [0, π] with any desired error bound >= 0.5 ulp using just native precision operations. However, the closer the target is to a correctly rounded function, the more up-front design work and computational work at runtime is required.

Transcendental functions implementations typically consist of argument reduction, core approximation(s), final fixup to counteract the argument reduction. In cases where the argument reduction involves subtraction, catastrophic cancellation needs to be avoided by explicitly or implicitly using higher precision. Implicit techniques can be designed to rely just on native precision computation, for example by splitting a constant like π into an unevaluated sum such as 1.57079637e+0f - 4.37113883e-8f when using IEEE-754 binary32 (single precision).

Achieving high accuracy with native precision computation is a lot easier when the hardware provides a fused multiply-add (FMA) operation. OP did not specify whether their target platform provides this operation, so I will first show a very simple approach offering moderate accuracy (maximum error < 5 ulps) relying just on multiplies and adds. I am assuming hardware that adheres to the IEEE-754 standard, and assume that float is mapped to IEEE-754 binary32 format.

The following is based on a blog post by Colin Wallace titled "Approximating sin(x) to 5 ULP with Chebyshev polynomials", which is not available online at time of writing. I originally retrieved it here and Google presently retains a cached copy here. They propose to approximate sine on [-π, π] by using a polynomial in x² of sin(x)/(x*(x²-π²)), then multiplying this by x*(x²-π²). A standard trick to compute a²-b² more accurately is to rewrite it as (a-b) * (a+b). Representing π as an unevaluated sum of two floating-point numbers pi_high and pi_low avoids catastrophic cancellation during subtraction, which turns the computation x²-π² into ((x - pi_hi) - pi_lo) * ((x + pi_hi) + pi_lo).

The polynomial core approximation should ideally use a minimax approximation, which minimizes the maximum error. I have done so here. Various standard tools like Maple or Mathematics can be used for this, or one create one's own code based on the Remez algorithm.

For a cosine computation on [0, PI] we can make use of the fact that cos (t) = sin (π/2 - t). Substituting x = (π/2 - t) into x * (x - π/2) * (x + π/2) yields (π/2 - t) * (3π/2 - t) * (-π/2 - t). The constants can be split into high and low parts (or head and tail, to use another common idiom) as before.

/* Approximate cosine on [0, PI] with maximum error of 4.704174 ulp */
float cosine (float x)
{
    const float half_pi_hi       =  1.57079637e+0f; //  0x1.921fb6p+0
    const float half_pi_lo       = -4.37113883e-8f; // -0x1.777a5cp-25
    const float three_half_pi_hi =  4.71238899e+0f; //  0x1.2d97c8p+2
    const float three_half_pi_lo = -1.19248806e-8f; // -0x1.99bc5cp-27
    float p, s, hpmx, thpmx, nhpmx;

    /* cos(x) = sin (pi/2 - x) = sin (hpmx) */
    hpmx = (half_pi_hi - x) + half_pi_lo;               // pi/2-x
    thpmx = (three_half_pi_hi - x) + three_half_pi_lo;  // 3*pi/2 - x
    nhpmx = (-half_pi_hi - x) - half_pi_lo;             // -pi/2 - x

    /* P(hpmx*hpmx) ~= sin (hpmx) / (hpmx * (hpmx * hpmx - pi * pi)) */
    s = hpmx * hpmx;
    p =         1.32729383e-10f;
    p = p * s - 2.33177868e-8f;
    p = p * s + 2.52223435e-6f;
    p = p * s - 1.73503853e-4f;
    p = p * s + 6.62087463e-3f;
    p = p * s - 1.01321176e-1f;
    return hpmx * nhpmx * thpmx * p;
}

Below I am showing a classical approach which first reduces the argument into [-π/4, π/4] while recording the quadrant. The quadrant then tells us whether we need to compute a polynomial approximation to the sine or the cosine on this primary approximation interval, and whether we need to flip the sign of the final result. This code assumes that the target platform supports the FMA operation specified by IEEE-754, and that it is mapped via the standard C function fmaf() for single precision.

The code is straightforward except for the float-to-int conversion with rounding mode to-nearest-or-even that is used to compute the quadrant, which is performed by the "magic number addition" method and combined with the multiplication of 2/π (equivalent to division by π/2). The maximum error is less than 1.5 ulps.

/* compute cosine on [0, PI] with maximum error of 1.429027 ulp */
float my_cosf (float a)
{
    const float half_pi_hi =  1.57079637e+0f; //  0x1.921fb6p+0
    const float half_pi_lo = -4.37113883e-8f; // -0x1.777a5cp-25
    float c, j, r, s, sa, t;
    int i;

    /* subtract closest multiple of pi/2 giving reduced argument and quadrant */
    j = fmaf (a, 6.36619747e-1f, 12582912.f) - 12582912.f; // 2/pi, 1.5 * 2**23
    a = fmaf (j, -half_pi_hi, a);
    a = fmaf (j, -half_pi_lo, a);

    /* phase shift of pi/2 (one quadrant) for cosine */
    i = (int)j;
    i = i + 1;

    sa = a * a;
    /* Approximate cosine on [-PI/4,+PI/4] with maximum error of 0.87444 ulp */
    c =               2.44677067e-5f;  //  0x1.9a8000p-16
    c = fmaf (c, sa, -1.38877297e-3f); // -0x1.6c0efap-10
    c = fmaf (c, sa,  4.16666567e-2f); //  0x1.555550p-5
    c = fmaf (c, sa, -5.00000000e-1f); // -0x1.000000p-1
    c = fmaf (c, sa,  1.00000000e+0f); //  1.00000000p+0
    /* Approximate sine on [-PI/4,+PI/4] with maximum error of 0.64196 ulp */
    s =               2.86567956e-6f;  //  0x1.80a000p-19
    s = fmaf (s, sa, -1.98559923e-4f); // -0x1.a0690cp-13
    s = fmaf (s, sa,  8.33338592e-3f); //  0x1.111182p-7
    s = fmaf (s, sa, -1.66666672e-1f); // -0x1.555556p-3
    t = a * sa;
    s = fmaf (s, t, a);

    /* select sine approximation or cosine approximation based on quadrant */
    r = (i & 1) ? c : s;
    /* adjust sign based on quadrant */
    r = (i & 2) ? (0.0f - r) : r;

    return r;
}

As it turns out, in this particular case the use of FMA provides only a tiny benefit in terms of accuracy. If I replace calls to fmaf(a,b,c) with ((a)*(b)+(c)), the maximum error increases minimally to 1.451367 ulps, that is, it stays below 1.5 ulps.

Thanks a lot! I actually have to check whether the hardware provides a FMA. Your implementation is using the same idea ericpostpischil mentioned. Im actually quite impressed how theses few lines code produce such a result. Very interesting and yet so simple. I have already tested your first approach, it works fine with about 30 instructions which is really nice. i am now going to implement your second solution. The tiny benefit of fma fits well since i dont have the standard math library available.... Is there a specific reason for the parantheses around each value (a), (b) — Dexter S, Sep 17 '20 at 08:41
@DexterS Re parentheses: Just cut and paste from the macro definition: `#define fmaf(a,b,c) ((a)*(b)+(c))`. The techniques of unevaluated sums and split constants to increase effective precision in intermediate computation have been around since the 1970s and predate the start of my professional work. Original authors are Kahan, Dekker, Cody/Waite. — njuffa, Sep 17 '20 at 08:48
Just tested your second approach, it works totally fine. I will accept your answer, thank you! if i do not want to use macros, i could hardcode it as for example ``j = ((a)* (6.36619747e-1f) +(12582912.f)) - 12582912.f; ``, right? — Dexter S, Sep 17 '20 at 08:52
@DexterS Absolutely. I would *not* recommend using macros to override standard C functions in production code. Since you are a new asker I will mention that it is considered good form to wait at least 24 hours before accepting an answer to give answer providers from all time zones an equal opportunity to contribute. — njuffa, Sep 17 '20 at 08:57
Thanks. okay, i'll wait with accepting the answer. I have two more questions, just for me to understand the algorithm: You mentioned that one could easily compute the remez algorithm using for example maple minimax. So the idea is always to first get the algorithm with double precision, and in the second step transform the equation to single precision by splitting up the constants? and how is ``((a)*(b)+(c)) /= (a*b+c)`` ? — Dexter S, Sep 17 '20 at 10:57
@DexterS In macro expansions, it is a best practice to surround all instances of macro arguments with parenthesis to avoid nasty surprises due to operator precedence. The Remez algorithm is a one-hundred year old algorithm for computing minimax approximations (usually polynomials, but exension to rational functions is possible). Maple, Mathematics, and the free Sollya tool have built-in facilities to generate such minimax approximations e.g. Sollya `fpminimax` command. Splitting constants: e.g. `double pi =3.14159265358979323; float pi_hi = (float)pi; float pi_lo = (float)(pi- (double)pi_hi);` — njuffa, Sep 17 '20 at 11:06
@njuffla Okay. i have a few questions derived from this topic, concerning max ulp calculation, argument reduction, transfer for sine ..... whats the best practice to address these ? it seems like the comments are not the right way to do so.... — Dexter S, Sep 17 '20 at 12:04
@DexterS Correct, we have been abusing comments here already. This is not a site for tutorials or discussions about a topic. This site is for specific on-topic question that have specific answers based in facts not opinions. — njuffa, Sep 17 '20 at 17:59
Re “It is certainly possible to compute cosine on [0, π] with any desired error bound >= 0.5 ulp using just native precision operations”: Technically true; cosine is computable, but it may be useful to qualify this if the “native precision” it is intended to apply to is any precision somebody might design a machine for. I am not sure to what extent humans have proven it can be computed with a known fixed bound on execution time. I think Crlibm did it for binary32 and binary64, but not much more? So higher precision implementations may require a loop with unknown bound. — Eric Postpischil, Sep 18 '20 at 02:56
@njuffa I'm still very happy with your solution. I want to understand the first algorithm in detail, and reading your explantation for several times now, I still don't understand the whole calculation: I understand the split of pi/2, and the transfer from x to pi/2-x for cosine. I assume that the calculation of p is the result of the minimax approximation you mentioned. But where does the last line come from? Since the original source is still not available, I would be grateful if you could explain the algorithm a little bit more. I also think that other users might benefit from your knowledge — Dexter S, Sep 22 '20 at 16:52

chux - Reinstate Monica · Answer 2 · 2020-09-17T13:25:00.593

I see @njuffa has a good approach yet want to pose another approach given:

Angle is likely originally in degrees, not radians and take advantage of that.
Does not depend on float being IEEE.
fma may be weak and so not use it.

Perform range reduction using integer math, then find answer via self adjusting Taylor series.

#include <assert.h>

static float my_sinf_helper(float xx, float term, unsigned n) {
  if (term + 1.0f == 1.0f) {
    return term;
  }
  return term - my_sinf_helper(xx, xx * term / ((n + 1) * (n + 2)), n + 2);
}

static float my_cosf_helper(float xx, float term, unsigned n) {
  if (term + 1.0f == 1.0f) {
    return term;
  }
  return term - xx * my_cosf_helper(xx, term / ((n + 1) * (n + 2)), n + 2);
}

// valid for [-pi/4 + pi/4]
static float my_sinf_primary(float x) {
  return x * my_sinf_helper(x * x, 1.0, 1);
}

// valid for [-pi/4 + pi/4]
static float my_cosf_primary(float x) {
  return my_cosf_helper(x * x, 1.0, 0);
}

#define MY_PIf 3.1415926535897932384626433832795f
#define D2Rf(d) ((d)*(MY_PIf/180))

float my_cosdf(float x) {
  if (x < 0) {x = -x;}
  unsigned long long ux = (unsigned long long) x;
  x -= (float) ux;
  unsigned ux_primary = ux % 360u;
  int uxq = ux_primary%90;
  if (uxq >= 45) uxq -= 90;
  x += uxq;
  switch (ux_primary/45) {
    case 7: //
    case 0: return my_cosf_primary(D2Rf(x));
    case 1: //
    case 2: return -my_sinf_primary(D2Rf(x));
    case 3: //
    case 4: return -my_cosf_primary(D2Rf(x));
    case 5: //
    case 6: return my_sinf_primary(D2Rf(x));
  }
  assert(0);
  return 0;
}

Test code

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define DBL_FMT "%+24.17e"

typedef struct {
  double x, y0, y1, adiff;
  unsigned n;
} test;

test worst = {0};

int my_cosd_test(float x) {
  test t;
  t.x = x;
  t.y0 = cos(x*acos(-1)/180);
  t.y1 = my_cosdf(x);
  t.adiff = fabs(t.y1 - t.y0);
  if (t.adiff > worst.adiff) {
    t.n = worst.n + 1;
    printf("n:%3u x:" DBL_FMT " y0:" DBL_FMT " y1:" DBL_FMT " d:" DBL_FMT "\n", //
        t.n, t.x, t.y0, t.y1, t.adiff);
    fflush(stdout);
    worst = t;
    if (t.n > 100)
      exit(-1);
  }
  return t.adiff != 0.0;
}

float rand_float_finite(void) {
  union {
    float f;
    unsigned char uc[sizeof(float)];
  } u;
  do {
    for (size_t i = 0; i < sizeof u.uc / sizeof u.uc[0]; i++) {
      u.uc[i] = (unsigned char) rand();
    }
  } while (!isfinite(u.f) || fabs(u.f) > 5000);
  return u.f;
}

int my_cosd_tests(unsigned n) {
  my_cosd_test(0.0);
  for (unsigned i = 0; i < n; i++) {
    my_cosd_test(rand_float_finite());
  }
  return 0;
}

int main(void) {
  my_cosd_tests(1000000);
}

Worst cast error: +8.2e-08. Max recursion depth note: 6.

n: 14 x:+3.64442993164062500e+03 y0:+7.14107074054115110e-01 y1:+7.14107155799865723e-01 d:+8.17457506130381262e-08

I'll review more later. I do see more extensive testing reaching about 9e-08 worst case error and some TBD issue with x > about 1e10.

Approximating cosine on [0,pi] using only single precision floating point

2 Answers2

Linked