Fastest implementation of sine, cosine and square root in C++ (doesn't need to be much accurate)

Question

I am googling the question for past hour, but there are only points to Taylor Series or some sample code that is either too slow or does not compile at all. Well, most answer I've found over Google is "Google it, it's already asked", but sadly it's not...

I am profiling my game on low-end Pentium 4 and found out that ~85% of execution time is wasted on calculating sinus, cosinus and square root (from standard C++ library in Visual Studio), and this seems to be heavily CPU dependent (on my I7 the same functions got only 5% of execution time, and the game is waaaaaaaaaay faster). I cannot optimize this three functions out, nor calculate both sine and cosine in one pass (there interdependent), but I don't need too accurate results for my simulation, so I can live with faster approximation.

So, the question: What are the fastest way to calculate sine, cosine and square root for float in C++?

EDIT Lookup table are more painful as resulting Cache Miss is way more costly on modern CPU than Taylor Series. The CPUs are just so fast these days, and cache is not.

I made a mistake, I though that I need to calculate several factorials for Taylor Series, and I see now they can be implemented as constants.

So the updated question: is there any speedy optimization for square root as well?

EDIT2

I am using square root to calculate distance, not normalization - can't use fast inverse square root algorithm (as pointed in comment: http://en.wikipedia.org/wiki/Fast_inverse_square_root

EDIT3

I also cannot operate on squared distances, I need exact distance for calculations

It was solved many years ago - use precomputed table to get the sine/cosine numbers. — SChepurin, Sep 06 '13 at 16:25
http://stackoverflow.com/questions/3688649/create-sine-lookup-table-in-c — drescherjm, Sep 06 '13 at 16:26
For the *inverse* square root (which is common, since it is involved in vector normalization), there is a well known formula (http://en.wikipedia.org/wiki/Fast_inverse_square_root), but honestly it is a bit outdated, and probably 1.0/sqrt(x) will enable some compiler optimization. — sbabbi, Sep 06 '13 at 16:26
have a look at this for sine and cosine: http://devmaster.net/forums/topic/4648-fast-and-accurate-sinecosine/ — user2366842, Sep 06 '13 at 16:26
@SChepurin It's not the fastest way since recently - the CPU are waaaay faster now then years ago and cache is not that much speedier. I mean Cache Miss is much more painful then calculating sine using Tylor Series — PiotrK, Sep 06 '13 at 16:27
What exactly is wrong with the Taylor Series? It sounds like exactly what you need. It allows you to calculate sin, cos, tan up to whatever accuracy you want. If you have trouble implementing it, then post that as a separate question. Alternatively, people have suggested a lookup table which can be very good, but lookups can be costly too. Fortunately, implementation is so fast you can do tests. — rliu, Sep 06 '13 at 16:28
You really need to explain more about what purpose you are using the results for. There are a variety of different approaches with different tradeoffs, and depending on what you are doing you may also be able to avoid the operations entirely, use SIMD, or amortise their cost across many calculations. It is impossible to advise on microoptimisation without knowing the details of the specific code to be microoptimised. — moonshadow, Sep 06 '13 at 16:29
@PiotrK - To get a number from look-up table is slower than calculating Taylor series? And any profiling results to prove this? — SChepurin, Sep 06 '13 at 16:35
How about dectecting the CPU and using a native instruction on a modern processor with a lookup table or other optimized code on older machines. — drescherjm, Sep 06 '13 at 16:39
@SChepurin I am quite sure I've seen benchmark comparison articles, but can't find them right now — PiotrK, Sep 06 '13 at 16:40
If you're using the distance calculation for comparison with another distance, just work with distance squared and you can do without the sqrt entirely. — Mark Ransom, Sep 06 '13 at 16:43
As I ask above, can you explain why you are calculating the distance? e.g. if you are calculating it in order to compare it with something, don't do the square root - square the value to be compared with instead. Similarly, if you explain what you are doing with sin/cos there may be ways to avoid those operations as well. — moonshadow, Sep 06 '13 at 16:44
@MarkRansom I need exact distance, that squared distance trick cannot be used in my case — PiotrK, Sep 06 '13 at 16:44
The usual approach for square root is to obtain an estimate for square root (or inverse square root) using a native CPU instruction, a lookup table, Carmack's function etc, typically to 1 part in 64 or so, then use it in Newton-Raphson until you have sufficient precision (IIRC 4 rounds for full float precision). This is actually what the C++ library will be doing, so if you need an exact result chances are the implementation you are using now is already optimal and the only ways to improve would be to use SIMD (SSE, etc) or change your algorithm to avoid the operation / reduce its frequency. — moonshadow, Sep 06 '13 at 16:49
When asking this kind of question, you need to specify your conditions much more precisely. Do you have any information on the distribution of the numbers for which you'll have to compute sin/cos/sqrt (say they are all close to 0)? Do you have specific constraints on the precision (say, does sin(0) absolutely have to be 0)? etc. Any extra information gives a way to improve the solution. — Marc Glisse, Sep 06 '13 at 17:51
Look up tables for trig functions are rarely a good idea on any processor from the last 15 years or so. What is your minimum spec? If your minimum spec is a Pentium 4 consider enabling SSE2 code generation and using SSE sqrt intrinsics. The runtime will link to SSE optimized versions of trig functions as well which may be enough of a speedup that you don't need to look further. — mattnewport, Oct 30 '13 at 19:18
If accuracy is of no concern whatsoever, you can use 0 to approximate sine, as its output will be in the range [-1, 1]. Huhu -- sorry, joke -- just speed and accuracy tend to be kind of tied together, and it's kind of ambiguous as to what degree of approximation is acceptable. LUTs typically aren't so beneficial here. I've found those bit twiddling magic number solutions like id's fast rsqrt to still provide some small benefit, although I'm in the C++03 era (legacy/platform) and quite a bit behind on the latest optimizers/standard libs. There are sin/cos variants of these as well. — , Nov 18 '15 at 06:08
A very old question here, but just so you know, if `T(n)` is the nth term of your Taylor for sine, then `T(n) = T(n-1)*x*x/(2*n*(2*n+1))` which means that a no factorial sine taylor approximation is (you have to decide the term to truncate at before hand) `float res = 1; for (int i = n;i>0;--i) res = 1 - res*x*x/(2*i*(2*i+1)); res *= x;` — Varad Mahashabde, Dec 04 '19 at 14:26
Taylor series are awful. See https://stackoverflow.com/q/345085/44330 and https://stackoverflow.com/a/394512/44330 — Jason S, Mar 30 '20 at 13:44

johnwbyrd · Answer 1 · 2021-02-17T22:48:55.717

Here's the guaranteed fastest possible sine function in C++:

double FastSin(double x)
{
    return 0;
}

Oh, you wanted better accuracy than |1.0|? Well, here is a sine function that is similarly fast:

double FastSin(double x)
{
    return x;
}

This answer actually does not suck, when x is close to zero. For small x, sin(x) is approximately equal to x, because x is the first term of the Taylor expansion of sin(x).

What, still not accurate enough for you? Well read on.

Engineers in the 1970s made some fantastic discoveries in this field, but new programmers are simply unaware that these methods exist, because they're not taught as part of standard computer science curricula.

You need to start by understanding that there is no "perfect" implementation of these functions for all applications. Therefore, superficial answers to questions like "which one is fastest" are guaranteed to be wrong.

Most people who ask this question don't understand the importance of the tradeoffs between performance and accuracy. In particular, you are going to have to make some choices regarding the accuracy of the calculations before you do anything else. How much error can you tolerate in the result? 10^-4? 10^-16?

Unless you can quantify the error in any method, don't use it. See all those random answers below mine, that post a bunch of random uncommented source code, without clearly documenting the algorithm used and its exact maximum error across the input range? "The error is approximately sort of mumble mumble I would guess." That's strictly bush league. If you don't know how to calculate the PRECISE maximum error, to FULL precision, in your approximation function, across the ENTIRE range of the inputs... then you don't know how to write an approximation function!

No one uses the Taylor series alone to approximate transcendentals in software. Except for certain highly specific cases, Taylor series generally approach the target slowly across common input ranges.

The algorithms that your grandparents used to calculate transcendentals efficiently, are collectively referred to as CORDIC and were simple enough to be implemented in hardware. Here is a well documented CORDIC implementation in C. CORDIC implementations, usually, require a very small lookup table, but most implementations don't even require that a hardware multiplier be available. Most CORDIC implementations let you trade off performance for accuracy, including the one I linked.

There's been a lot of incremental improvements to the original CORDIC algorithms over the years. For example, last year some researchers in Japan published an article on an improved CORDIC with better rotation angles, which reduces the operations required.

If you have hardware multipliers sitting around (and you almost certainly do), or if you can't afford a lookup table like CORDIC requires, you can always use a Chebyshev polynomial to do the same thing. Chebyshev polynomials require multiplies, but this is rarely a problem on modern hardware. We like Chebyshev polynomials because they have highly predictable maximum errors for a given approximation. The maximum of the last term in a Chebyshev polynomial, across your input range, bounds the error in the result. And this error gets smaller as the number of terms gets larger. Here's one example of a Chebyshev polynomial giving a sine approximation across a huge range, ignoring the natural symmetry of the sine function and just solving the approximation problem by throwing more coefficients at it. And here's an example of estimating a sine function to within 5 ULPs. Don't know what an ULP is? You should.

We also like Chebyshev polynomials because the error in the approximation is equally distributed across the range of outputs. If you're writing audio plugins or doing digital signal processing, Chebyshev polynomials give you a cheap and predictable dithering effect "for free."

If you want to find your own Chebyshev polynomial coefficients across a specific range, many math libraries call the process of finding those coefficients "Chebyshev fit" or something like that.

Square roots, then as now, are usually calculated with some variant of the Newton-Raphson algorithm, usually with a fixed number of iterations. Usually, when someone cranks out an "amazing new" algorithm for doing square roots, it's merely Newton-Raphson in disguise.

Newton-Raphson, CORDIC, and Chebyshev polynomials let you trade off speed for accuracy, so the answer can be just as imprecise as you want.

Lastly, when you've finished all your fancy benchmarking and micro-optimization, make sure that your "fast" version is actually faster than the library version. Here is a typical library implementation of fsin() bounded on the domain from -pi/4 to pi/4. And it just ain't that damned slow.

One last caution to you: you're almost certainly using IEEE-754 math to perform your estimations, and anytime you're performing IEEE-754 math with a bunch of multiplies, then some obscure engineering decisions made decades ago will come back to haunt you, in the form of roundoff errors. And those errors start small, but they get bigger, and Bigger, and BIGGER! At some point in your life, please read "What every computer scientist should know about floating point numbers" and have the appropriate amount of fear. Keep in mind that if you start writing your own transcendental functions, you'll need to benchmark and measure the ACTUAL error due to floating-point roundoff, not just the maximum theoretical error. This is not a theoretical concern; "fast math" compilation settings have bit me in the butt, on more than one project.

tl:dr; go google "sine approximation" or "cosine approximation" or "square root approximation" or "approximation theory."

For float/double, most platforms have efficient hardware sqrt. On x86, hardware sqrt is faster than anything you could cook up yourself, except using the hardware fast approximate reciprocal sqrt instruction. I guess without a hardware FPU, or if it has very slow sqrt but fast multiply, NR could be a win. — Peter Cordes, May 18 '17 at 21:00
The x86 hardware itself is doing a Newton-Raphson iterative approximation. — johnwbyrd, Jun 30 '17 at 02:41
It doesn't matter how the hardware is wired up; all that matters is how fast it is relative to an FP multiply (or fused multiply-add). The `sqrt` instruction is a black box that spits out correctly-rounded sqrt results *extremely* fast (e.g. on Skylake with 12 cycle latency, one per 3 cycle throughput). You can't beat that with a Newton-Raphson iteration starting with `rsqrtps` (approximate reciprocal sqrt). Using *just* `rsqrtps` (giving 12-bit precision) is faster, or if you need the sqrt instead of the reciprocal, `x * approx_rsqrt(x)` is somewhat faster than `sqrt(x)`. — Peter Cordes, Jun 30 '17 at 02:51
Unless you bottleneck on uop throughput rather than sqrt latency, in which case using plain `sqrtps` is faster even than `rsqrtps` + `fmaddps`, because it `sqrtps` decodes to a single uop (the table-lookup + Newton-Raphson happens inside the divider unit, not driven by microcode which would compete with other instructions for execution resources). — Peter Cordes, Jun 30 '17 at 02:55
You're an x86 expert and your opinion is correct from that perspective. My code has to run on every platform imaginable, so I tend to think algorithmically before pulling up the assembler. "The best optimizer is between your ears" -Abrash — johnwbyrd, Nov 04 '17 at 22:32
It's definitely a luxury to manually vectorize for only one specific ISA where all the HW you care about has efficient sqrt. I don't know what HW sqrt is like on ARM, or high-end PowerPC (used in a few HPC clusters). It's definitely important to have good algorithms, but when the constant factors matter it's important to understand the relative costs and know when an approximation is actually slower. `sqrt` is special because it's one of the functions that IEEE requires produce a correctly-rounded result, along with + - * /, so it is usually implemented in hardware. — Peter Cordes, Nov 05 '17 at 03:00
If the ARM in question has NEON capabilities, then it can do sqrt. Thumb and Thumb2 have no built-in sqrt. No idea about PPC. — johnwbyrd, Nov 10 '17 at 08:19
Ok, but the question is how fast that hardware sqrt is. P5 (pentium 586) has x87 `fsqrt`, but it takes 70 clock cycles, vs. fadd / fmul being 3 cycle latency. (P5 is an in-order CPU, but it can overlap fsqrt with integer instructions for 69 of those clock cycles, but only by 2 of the 70 cycles with other FP work). A fast approximate sqrt would definitely be worth considering on P5. — Peter Cordes, Nov 10 '17 at 16:43
For "make sure that your "fast" version is actually faster than the library version", +1 — fishinear, Nov 27 '18 at 16:12
If you declare `FastSine` as `constexpr`, it will be even faster. — davidhigh, Feb 06 '19 at 12:41
Good answer, although I am a bit sceptical about numerical recipes. Accuracy is never discussed quantitatively if I remember correctly. Do you know of any more 'scientific' book about this topic — lalala, Nov 19 '19 at 12:01

score 48 · Answer 2 · edited Sep 08 '19 at 07:17

First, the Taylor series is NOT the best/fastest way to implement sine/cos. It is also not the way professional libraries implement these trigonometric functions, and knowing the best numerical implementation allows you to tweak the accuracy to get speed more efficiently. In addition, this problem has already been extensively discussed in StackOverflow. Here is just one example.

Second, the big difference you see between old/new PCS is due to the fact that modern Intel architecture has explicit assembly code to calculate elementary trigonometric functions. It is quite hard to beat them on execution speed.

Finally, let's talk about the code on your old PC. Check gsl gnu scientific library (or numerical recipes) implementation, and you will see that they basically use Chebyshev Approximation Formula.

Chebyshev approximation converges faster, so you need to evaluate fewer terms. I won't write implementation details here because there are already very nice answers posted on StackOverflow. Check this one for example . Just tweak the number of terms on this series to change the balance between accuracy/speed.

For this kind of problem, if you want implementation details of some special function or numerical method, you should take a look on GSL code before any further action - GSL is THE STANDARD numerical library.

EDIT: you may improve the execution time by including aggressive floating point optimization flags in gcc/icc. This will decrease the precision, but it seems that is exactly what you want.

EDIT2: You can try to make a coarse sin grid and use gsl routine (gsl_interp_cspline_periodic for spline with periodic conditions) to spline that table (the spline will reduce the errors in comparison to a linear interpolation => you need less points on your table => less cache miss)!

score 24 · Answer 3 · edited Nov 20 '16 at 19:29

24

Based on the idea of http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648 and some manual rewriting to improve the performance in a micro benchmark I ended up with the following cosine implementation which is used in a HPC physics simulation that is bottlenecked by repeated cos calls on a large number space. It's accurate enough and much faster than a lookup table, most notably no division is required.

template<typename T>
inline T cos(T x) noexcept
{
    constexpr T tp = 1./(2.*M_PI);
    x *= tp;
    x -= T(.25) + std::floor(x + T(.25));
    x *= T(16.) * (std::abs(x) - T(.5));
    #if EXTRA_PRECISION
    x += T(.225) * x * (std::abs(x) - T(1.));
    #endif
    return x;
}

The Intel compiler at least is also smart enough in vectorizing this function when used in a loop.

If EXTRA_PRECISION is defined, the maximum error is about 0.00109 for the range -π to π, assuming T is double as it's usually defined in most C++ implementations. Otherwise, the maximum error is about 0.056 for the same range.

edited Nov 20 '16 at 19:29

Peter O.

28,965
14
72
87

answered Jan 20 '15 at 16:24

milianw

4,621
2
33
39

theres a division in the first line – Hexo Jul 06 '16 at 19:33
15

Yes, but that is a compile-time constant division which is infinitely cheap at runtime :P – milianw Jul 07 '16 at 08:36
1

I'd like to see a benchmark against standard library cosine. https://stackoverflow.com/questions/824118/why-is-floor-so-slow – johnwbyrd Nov 05 '17 at 04:12
@johnwbyrd I don't have access to a license of the Intel compiler anymore, where the difference was the biggest. Reason being as I said that it managed to vectorize the above function and surrounding code, whereas it didn't do that as nicely for std::cos. The link to "floor is slow" also shows how `-ffast-math` helps alleviate the issue somewhat. ICC does this by default. – milianw Nov 06 '17 at 11:36
5

For all those other visual learners out there that wanna see how this black mathgic works: https://www.desmos.com/calculator/cbuhbme355 – Chet Jun 27 '20 at 06:05
@Chet thanks for that, very informative! I did something similar back then using gnuplot to get a feeling for the deviation. This interactive graph is much more intuitive, thanks! – milianw Jul 02 '20 at 12:41
"accurate enough" – johnwbyrd Aug 10 '20 at 02:17
Tried it out, am impressed. What amazes me is that it's accurate ('enough') way outside the range of -π to π. How is that even possible? – Wim Rijnders Feb 27 '21 at 08:49
On previous comment, tried out @Chet's link and that helped understanding it. Thanks! Will be using this for GPU programming. – Wim Rijnders Feb 27 '21 at 09:09
@WimRijnders afaik most GPUs have one-cycle sin/cos instructions - there's probably no need for this optimization when you are doing GPU programming. – milianw Feb 28 '21 at 12:43
@milianw Not my GPU, I assure you. I checked vainly again and again. FYI, the VideoCore GPU on Raspberry Pi's. – Wim Rijnders Feb 28 '21 at 17:00
I confirm that milianw's solution (converted for sine instead of cosine) is about 5 times faster than Java's built-in `Math.sin(x)` while Nick's original answer (upon which millianw based his solution) is only about 2 times faster. – Mike Nakis Mar 30 '21 at 19:03

score 23 · Answer 4 · answered Sep 06 '13 at 16:46

23

For square root, there is an approach called bit shift.

A float number defined by IEEE-754 is using some certain bit represent describe times of multiple based 2. Some bits are for represent the base value.

float squareRoot(float x)
{
  unsigned int i = *(unsigned int*) &x;

  // adjust bias
  i  += 127 << 23;
  // approximation of square root
  i >>= 1;

  return *(float*) &i;
}

That's a constant time calculating the squar root

answered Sep 06 '13 at 16:46

BigTailWolf

999
4
15

What is the purpose of taking the address of `i` then casting it to `float*` before dereferencing it again? – Nubcake Aug 20 '17 at 00:10
1

https://en.wikipedia.org/wiki/Fast_inverse_square_root#Aliasing_to_an_integer_as_an_approximate_logarithm – johnwbyrd Sep 29 '17 at 03:31
@Nubcake type punning. Knowing the ieee representation of float, this can work. – Red.Wave May 23 '18 at 17:57

score 21 · Accepted Answer · edited May 23 '17 at 12:34

21

The fastest way is to pre-compute values an use a table like in this example:

Create sine lookup table in C++

BUT if you insist upon computing at runtime you can use the Taylor series expansion of sine or cosine...

Taylor Series of sine

For more on the Taylor series... http://en.wikipedia.org/wiki/Taylor_series

One of the keys to getting this to work well is pre-computing the factorials and truncating at a sensible number of terms. The factorials grow in the denominator very quickly, so you don't need to carry more than a few terms.

Also...Don't multiply your x^n from the start each time...e.g. multiply x^3 by x another two times, then that by another two to compute the exponents.

edited May 23 '17 at 12:34

Community

1
1

answered Sep 06 '13 at 16:28

KeatsKelleher

9,063
4
41
47

And as for square root? – PiotrK Sep 06 '13 at 16:29
1

He said at the start of his post that he doesn't like Taylor series but didn't explain why. – rliu Sep 06 '13 at 16:29
4

@roliu My mistake, I though I have to calculate Factorial several times, but I missed that I can use precomputed constant – PiotrK Sep 06 '13 at 16:31
+1 for precomputing factorials, that should save a bunch of cpu cycles by itself. – user2366842 Sep 06 '13 at 16:38
2

There's an interesting link in the question comments that shows a method even better than the Taylor series. – Mark Ransom Sep 06 '13 at 16:41
18

This is not the best way to calculate sin/cos in terms of efficiency. There are old answers in stackoverflow that had already discussed this in great detail. Also - GSL GNU scientific library, which is the standard numerical library used everywhere, also does not use that. Knowing the best numerical procedure allows you to balance accuracy/speed more precisely. – Vivian Miranda Sep 07 '13 at 00:06
Taylor series with 3-4 iterations is measurably faster than a lookuptable nowadays, although it's not the fastest possibility (that would be a parabolic approximation, I think code was on Flipcode some 10 years or so ago...). – Damon Jan 15 '15 at 16:51
The Maclaurin series (Taylor series is a more general class of expansions, of which the famous Maclaurin series are a subset) is an efficient and precise tool for education and for proving theorems. It is not an efficient or precise tool for calculation. This really shouldn't be the accepted answer, as there are much more useful ones below – tobi_s Mar 29 '21 at 01:38

Peter Cordes · Answer 6 · 2017-07-03T11:28:54.517

For x86, the hardware FP square root instructions are fast (sqrtss is sqrt Scalar Single-precision). Single precision is faster than double-precision, so definitely use float instead of double for code where you can afford to use less precision.

For 32bit code, you usually need compiler options to get it to do FP math with SSE instructions, rather than x87. (e.g. -mfpmath=sse)

To get C's sqrt() or sqrtf() functions to inline as just sqrtsd or sqrtss, you need to compile with -fno-math-errno. Having math functions set errno on NaN is generally considered a design mistake, but the standard requires it. Without that option, gcc inlines it but then does a compare+branch to see if the result was NaN, and if so calls the library function so it can set errno. If your program doesn't check errno after math functions, there is no danger in using -fno-math-errno.

You don't need any of the "unsafe" parts of -ffast-math to get sqrt and some other functions to inline better or at all, but -ffast-math can make a big difference (e.g. allowing the compiler to auto-vectorize in cases where that changes the result, because FP math isn't associative.

e.g. with gcc6.3 compiling float foo(float a){ return sqrtf(a); }

foo:    # with -O3 -fno-math-errno.
    sqrtss  xmm0, xmm0
    ret

foo:   # with just -O3
    pxor    xmm2, xmm2   # clang just checks for NaN, instead of comparing against zero.
    sqrtss  xmm1, xmm0
    ucomiss xmm2, xmm0
    ja      .L8          # take the slow path if 0.0 > a
    movaps  xmm0, xmm1
    ret

.L8:                     # errno-setting path
    sub     rsp, 24
    movss   DWORD PTR [rsp+12], xmm1   # store the sqrtss result because the x86-64 SysV ABI has no call-preserved xmm regs.
    call    sqrtf                      # call sqrtf just to set errno
    movss   xmm1, DWORD PTR [rsp+12]
    add     rsp, 24
    movaps  xmm0, xmm1    # extra mov because gcc reloaded into the wrong register.
    ret

gcc's code for the NaN case seems way over-complicated; it doesn't even use the sqrtf return value! Anyway, this is the kind of mess you actually get without -fno-math-errno, for every sqrtf() call site in your program. Mostly it's just code-bloat, and none of the .L8 block will ever run when taking the sqrt of a number >= 0.0, but there's still several extra instructions in the fast path.

If you know that your input to sqrt is non-zero, you can use the fast but very approximate reciprocal sqrt instruction, rsqrtps (or rsqrtss for the scalar version). One Newton-Raphson iteration brings it up to nearly the same precision as the hardware single-precision sqrt instruction, but not quite.

sqrt(x) = x * 1/sqrt(x), for x!=0, so you can calculate a sqrt with rsqrt and a multiply. These are both fast, even on P4 (was that still relevant in 2013)?

On P4, it may be worth using rsqrt + Newton iteration to replace a single sqrt, even if you don't need to divide anything by it.

See also an answer I wrote recently about handling zeroes when calculating sqrt(x) as x*rsqrt(x), with a Newton Iteration. I included some discussion of rounding error if you want to convert the FP value to an integer, and links to other relevant questions.

P4:

sqrtss: 23c latency, not pipelined
sqrtsd: 38c latency, not pipelined
fsqrt (x87): 43c latency, not pipelined
rsqrtss / mulss: 4c + 6c latency. Possibly one per 3c throughput, since they apparently don't need the same execution unit (mmx vs. fp).
SIMD packed versions are somewhat slower

Skylake:

sqrtss/sqrtps: 12c latency, one per 3c throughput
sqrtsd/sqrtpd: 15-16c latency, one per 4-6c throughput
fsqrt (x87): 14-21cc latency, one per 4-7c throughput
rsqrtss / mulss: 4c + 4c latency. One per 1c throughput.
SIMD 128b vector versions are the same speed. 256b vector versions are a bit higher latency, almost half throughput. The rsqrtss version has full performance for 256b vectors.

With a Newton Iteration, the rsqrt version is not much if at all faster.

Numbers from Agner Fog's experimental testing. See his microarch guides to understand what makes code run fast or slow. Also see links at the x86 tag wiki.

IDK how best to calculate sin/cos. I've read that the hardware fsin / fcos (and the only slightly slower fsincos that does both at once) are not the fastest way, but IDK what is.

Adriel Jr · Answer 7 · 2020-10-29T14:37:38.950

QT has fast implementations of sine (qFastSin) and cosine (qFastCos) that uses look up table with interpolation. I'm using it in my code and they are faster than std:sin/cos and precise enough for what I need:

https://code.woboq.org/qt5/qtbase/src/corelib/kernel/qmath.h.html#_Z8qFastSind

#define QT_SINE_TABLE_SIZE 256


inline qreal qFastSin(qreal x)
{
   int si = int(x * (0.5 * QT_SINE_TABLE_SIZE / M_PI)); // Would be more accurate with qRound, but slower.
   qreal d = x - si * (2.0 * M_PI / QT_SINE_TABLE_SIZE);
   int ci = si + QT_SINE_TABLE_SIZE / 4;
   si &= QT_SINE_TABLE_SIZE - 1;
   ci &= QT_SINE_TABLE_SIZE - 1;
   return qt_sine_table[si] + (qt_sine_table[ci] - 0.5 * qt_sine_table[si] * d) * d;
}

inline qreal qFastCos(qreal x)
{
   int ci = int(x * (0.5 * QT_SINE_TABLE_SIZE / M_PI)); // Would be more accurate with qRound, but slower.
   qreal d = x - ci * (2.0 * M_PI / QT_SINE_TABLE_SIZE);
   int si = ci + QT_SINE_TABLE_SIZE / 4;
   si &= QT_SINE_TABLE_SIZE - 1;
   ci &= QT_SINE_TABLE_SIZE - 1;
   return qt_sine_table[si] - (qt_sine_table[ci] + 0.5 * qt_sine_table[si] * d) * d;
}

The LUT and license can be found here: https://code.woboq.org/qt5/qtbase/src/corelib/kernel/qmath.cpp.html#qt_sine_table

These pair of functions take radian inputs. The LUT covers the entire 2π input range. The function interpolates between values using the difference d, using the cosine (with a similar interpolation using sine again) as the derivative.

I remember that Stephen Landrum implemented this technique at 3DO around 1993. — johnwbyrd, Apr 01 '20 at 04:55
@johnwbyrd Someone posted the worst case error here: https://stackoverflow.com/a/7725752 — Adriel Jr, May 11 '20 at 16:55
That estimate does not take into account floating point rounding effects. — johnwbyrd, Jun 15 '20 at 22:08

score 4 · Answer 8 · answered Apr 08 '19 at 16:38

Sharing my code, it's a 6th degree polynomial, nothing special but rearranged to avoid pows. On Core i7 this is 2.3 times slower than standard implementation, although a bit faster for [0..2*PI] range. For an old processor this could be an alternative to standard sin/cos.

/*
    On [-1000..+1000] range with 0.001 step average error is: +/- 0.000011, max error: +/- 0.000060
    On [-100..+100] range with 0.001 step average error is:   +/- 0.000009, max error: +/- 0.000034
    On [-10..+10] range with 0.001 step average error is:     +/- 0.000009, max error: +/- 0.000030
    Error distribution ensures there's no discontinuity.
*/

const double PI          = 3.141592653589793;
const double HALF_PI     = 1.570796326794897;
const double DOUBLE_PI   = 6.283185307179586;
const double SIN_CURVE_A = 0.0415896;
const double SIN_CURVE_B = 0.00129810625032;

double cos1(double x) {
    if (x < 0) {
        int q = -x / DOUBLE_PI;
        q += 1;
        double y = q * DOUBLE_PI;
        x = -(x - y);
    }
    if (x >= DOUBLE_PI) {
        int q = x / DOUBLE_PI;
        double y = q * DOUBLE_PI;
        x = x - y;
    }
    int s = 1;
    if (x >= PI) {
        s = -1;
        x -= PI;
    }
    if (x > HALF_PI) {
        x = PI - x;
        s = -s;
    }
    double z = x * x;
    double r = z * (z * (SIN_CURVE_A - SIN_CURVE_B * z) - 0.5) + 1.0;
    if (r > 1.0) r = r - 2.0;
    if (s > 0) return r;
    else return -r;
}

double sin1(double x) {
    return cos1(x - HALF_PI);
}

score 3 · Answer 9 · edited Dec 14 '19 at 00:19

I use the following CORDIC code to compute trigonometric functions in quadruple precision. The constant N determines the number of bits of precision required (for example N=26 will give single precision accuracy). Depending on desired accuracy, the precomputed storage can be small and will fit in the cache. It only requires addition and multiplication operations and is also very easy to vectorize.

The algorithm pre-computes sin and cos values for 0.5^i, i=1,...,N. Then, we can combine these precomputed values, to compute sin and cos for any angle up to a resolution of 0.5^N

template <class QuadReal_t>
QuadReal_t sin(const QuadReal_t a){
  const int N=128;
  static std::vector<QuadReal_t> theta;
  static std::vector<QuadReal_t> sinval;
  static std::vector<QuadReal_t> cosval;
  if(theta.size()==0){
    #pragma omp critical (QUAD_SIN)
    if(theta.size()==0){
      theta.resize(N);
      sinval.resize(N);
      cosval.resize(N);

      QuadReal_t t=1.0;
      for(int i=0;i<N;i++){
        theta[i]=t;
        t=t*0.5;
      }

      sinval[N-1]=theta[N-1];
      cosval[N-1]=1.0-sinval[N-1]*sinval[N-1]/2;
      for(int i=N-2;i>=0;i--){
        sinval[i]=2.0*sinval[i+1]*cosval[i+1];
        cosval[i]=sqrt(1.0-sinval[i]*sinval[i]);
      }
    }
  }

  QuadReal_t t=(a<0.0?-a:a);
  QuadReal_t sval=0.0;
  QuadReal_t cval=1.0;
  for(int i=0;i<N;i++){
    while(theta[i]<=t){
      QuadReal_t sval_=sval*cosval[i]+cval*sinval[i];
      QuadReal_t cval_=cval*cosval[i]-sval*sinval[i];
      sval=sval_;
      cval=cval_;
      t=t-theta[i];
    }
  }
  return (a<0.0?-sval:sval);
}

score 3 · Answer 10 · edited May 23 '17 at 12:26

This is an implementation of Taylor Series method previously given in akellehe's answer.

unsigned int Math::SIN_LOOP = 15;
unsigned int Math::COS_LOOP = 15;

// sin(x) = x - x^3/3! + x^5/5! - x^7/7! + ...
template <class T>
T Math::sin(T x)
{
    T Sum       = 0;
    T Power     = x;
    T Sign      = 1;
    const T x2  = x * x;
    T Fact      = 1.0;
    for (unsigned int i=1; i<SIN_LOOP; i+=2)
    {
        Sum     += Sign * Power / Fact;
        Power   *= x2;
        Fact    *= (i + 1) * (i + 2);
        Sign    *= -1.0;
    }
    return Sum;
}

// cos(x) = 1 - x^2/2! + x^4/4! - x^6/6! + ...
template <class T>
T Math::cos(T x)
{
    T Sum       = x;
    T Power     = x;
    T Sign      = 1.0;
    const T x2  = x * x;
    T Fact      = 1.0;
    for (unsigned int i=3; i<COS_LOOP; i+=2)
    {
        Power   *= x2;
        Fact    *= i * (i - 1);
        Sign    *= -1.0;
        Sum     += Sign * Power / Fact;
    }
    return Sum;
}

This is a delightful representation of the Taylor Series. Thank you. — David Scott Kirby, Mar 19 '21 at 01:55

score 2 · Answer 11 · edited Dec 16 '19 at 00:09

So let me rephrase that, this idea comes from approximating the cosine & sine functions on an interval [-pi/4,+pi/4] with a bounded error using the Remez algorithm. Then using the range reduced float remainder and a LUT for the outputs cos & sine of the integer quotient, the approximation can be moved to any angular argument.

Its just unique and I thought it could be expanded on to make a more efficient algorithm in terms of a bounded error.

void sincos_fast(float x, float *pS, float *pC){
    float cosOff4LUT[] = { 0x1.000000p+00,  0x1.6A09E6p-01,  0x0.000000p+00, -0x1.6A09E6p-01, -0x1.000000p+00, -0x1.6A09E6p-01,  0x0.000000p+00,  0x1.6A09E6p-01 };

    int     m, ms, mc;
    float   xI, xR, xR2;
    float   c, s, cy, sy;

    // Cody & Waite's range reduction Algorithm, [-pi/4, pi/4]
    xI  = floorf(x * 0x1.45F306p+00 + 0.5);              // This is 4/pi.
    xR  = (x - xI * 0x1.920000p-01) - xI*0x1.FB5444p-13; // This is pi/4 in two parts per C&W.
    m   = (int) xI;
    xR2 = xR*xR;

    // Find cosine & sine index for angle offsets indices
    mc = (  m  ) & 0x7;     // two's complement permits upper modulus for negative numbers =P
    ms = (m + 6) & 0x7;     // phase correction for sine.

    // Find cosine & sine
    cy = cosOff4LUT[mc];     // Load angle offset neighborhood cosine value 
    sy = cosOff4LUT[ms];     // Load angle offset neighborhood sine value 

    c = 0xf.ff79fp-4 + xR2 * (-0x7.e58e9p-4);               // TOL = 1.2786e-4
    // c = 0xf.ffffdp-4 + xR2 * (-0x7.ffebep-4 + xR2 * 0xa.956a9p-8);  // TOL = 1.7882e-7

    s = xR * (0xf.ffbf7p-4 + xR2 * (-0x2.a41d0cp-4));   // TOL = 4.835251e-6
    // s = xR * (0xf.fffffp-4 + xR2 * (-0x2.aaa65cp-4 + xR2 * 0x2.1ea25p-8));  // TOL = 1.1841e-8

    *pC = c*cy - s*sy;      
    *pS = c*sy + s*cy;
}

float sqrt_fast(float x){
    union {float f; int i; } X, Y;
    float ScOff;
    uint8_t e;

    X.f = x;
    e = (X.i >> 23);           // f.SFPbits.e;

    if(x <= 0) return(0.0f);

    ScOff = ((e & 1) != 0) ? 1.0f : 0x1.6a09e6p0;  // NOTE: If exp=EVEN, b/c (exp-127) a (EVEN - ODD) := ODD; but a (ODD - ODD) := EVEN!!

    e = ((e + 127) >> 1);                            // NOTE: If exp=ODD,  b/c (exp-127) then flr((exp-127)/2)
    X.i = (X.i & ((1uL << 23) - 1)) | (0x7F << 23);  // Mask mantissa, force exponent to zero.
    Y.i = (((uint32_t) e) << 23);

    // Error grows with square root of the exponent. Unfortunately no work around like inverse square root... :(
    // Y.f *= ScOff * (0x9.5f61ap-4 + X.f*(0x6.a09e68p-4));        // Error = +-1.78e-2 * 2^(flr(log2(x)/2))
    // Y.f *= ScOff * (0x7.2181d8p-4 + X.f*(0xa.05406p-4 + X.f*(-0x1.23a14cp-4)));      // Error = +-7.64e-5 * 2^(flr(log2(x)/2))
    // Y.f *= ScOff * (0x5.f10e7p-4 + X.f*(0xc.8f2p-4 +X.f*(-0x2.e41a4cp-4 + X.f*(0x6.441e6p-8))));     // Error =  8.21e-5 * 2^(flr(log2(x)/2))
    // Y.f *= ScOff * (0x5.32eb88p-4 + X.f*(0xe.abbf5p-4 + X.f*(-0x5.18ee2p-4 + X.f*(0x1.655efp-4 + X.f*(-0x2.b11518p-8)))));   // Error = +-9.92e-6 * 2^(flr(log2(x)/2))
    // Y.f *= ScOff * (0x4.adde5p-4 + X.f*(0x1.08448cp0 + X.f*(-0x7.ae1248p-4 + X.f*(0x3.2cf7a8p-4 + X.f*(-0xc.5c1e2p-8 + X.f*(0x1.4b6dp-8))))));   // Error = +-1.38e-6 * 2^(flr(log2(x)/2))
    // Y.f *= ScOff * (0x4.4a17fp-4 + X.f*(0x1.22d44p0 + X.f*(-0xa.972e8p-4 + X.f*(0x5.dd53fp-4 + X.f*(-0x2.273c08p-4 + X.f*(0x7.466cb8p-8 + X.f*(-0xa.ac00ep-12)))))));    // Error = +-2.9e-7 * 2^(flr(log2(x)/2))
    Y.f *= ScOff * (0x3.fbb3e8p-4 + X.f*(0x1.3b2a3cp0 + X.f*(-0xd.cbb39p-4 + X.f*(0x9.9444ep-4 + X.f*(-0x4.b5ea38p-4 + X.f*(0x1.802f9ep-4 + X.f*(-0x4.6f0adp-8 + X.f*(0x5.c24a28p-12 ))))))));   // Error = +-2.7e-6 * 2^(flr(log2(x)/2))

    return(Y.f);
}

The longer expressions are longer, slower, but more precise. Polynomials are written per Horner's rule.

This does not explain how you computed the constants (what is the algorithm used), there is an error on x2 that should read xR2, and it's computed error is much much larger than the other answers. Also, I've benchmarked it and it twice slower than milianw answers. The comments doesn't make any sense (the last commented line stated that the error is lower than the uncommented code, why?) — xryl669, Feb 06 '19 at 18:28
@étale-cohomology ok where is the answer key? and yes sorry for not fully benchmarking this, I thought this was fairly unique during the time of posting. — nimig18, Feb 03 '21 at 06:11
@xryl669 Constants were solved adhoc after remez algorithm in sollya. Millianw's answer has worse error but sure twice the speed, go for it. Great observation its hitting the limit of the accuracy of the 32-bit float from poor rounding on my part. sorry bud — nimig18, Feb 03 '21 at 06:19
@niming18 why are you apologizing? this is good code! And an awesome algorithm! — étale-cohomology, Feb 03 '21 at 18:58

Fran x1024 · Answer 12 · 2020-04-16T23:45:59.743

This is a sinus implementation that should be quite fast, it works like this:

it has an arithemtical implementation of square rooting complex numbers

from analitical math with complex numbers you know that the angle is halfed when a complex number is square rooted

You can take a complex number whose angle you already know (e.g. i, has angle 90 degrees or PI / 2 radians)

Than by square rooting it you can get complex numbers of form cos (90 / 2^n) + i sin (90 / 2^n)

from analitical math with complex numbers you know that when two numbers multiply their angles add up

you can show the number k (one you get as an argument in sin or cos) as sum of angles 90 / 2^n and then get the resulting values by multiplying those complex numbers you precomputed

result will be in form cos k + i sin k

#define PI 3.14159265
#define complex pair <float, float>

/* this is square root function, uses binary search and halves mantisa */

float sqrt(float a) {

    float b = a;

    int *x = (int*) (&b); // here I get integer pointer to float b which allows me to directly change bits from float reperesentation

    int c = ((*x >> 23) & 255) - 127; // here I get mantisa, that is exponent of 2 (floats are like scientific notation 1.111010101... * 2^n)

    if(c < 0) c = -((-c) >> 1); // ---
                                //   |--> This is for halfing the mantisa
    else c >>= 1;               // ---

    *x &= ~(255 << 23); // here space reserved for mantisa is filled with 0s

    *x |= (c + 127) << 23; // here new mantisa is put in place

    for(int i = 0; i < 5; i++) b = (b + a / b) / 2; // here normal square root approximation runs 5 times (I assume even 2 or 3 would be enough)

    return b;
}

/* this is a square root for complex numbers (I derived it in paper), you'll need it later */

complex croot(complex x) {

    float c = x.first, d = x.second;

    return make_pair(sqrt((c + sqrt(c * c + d * d)) / 2), sqrt((-c + sqrt(c * c + d * d)) / 2) * (d < 0 ? -1 : 1));
}

/* this is for multiplying complex numbers, you'll also need it later */

complex mul(complex x, complex y) {

    float a = x.first, b = x.second, c = y.first, d = y.second;

    return make_pair(a * c - b * d, a * d + b * c);
}

/* this function calculates both sinus and cosinus */

complex roots[24];

float angles[24];

void init() {

    complex c = make_pair(-1, 0); // first number is going to be -1

    float alpha = PI; // angle of -1 is PI

    for(int i = 0; i < 24; i++) {

        roots[i] = c; // save current c

        angles[i] = alpha; // save current angle

        c = croot(c); // root c

        alpha *= 0.5; // halve alpha
    }
}

complex cosin(float k) {

    complex r = make_pair(1, 0); // at start 1

    for(int i = 0; i < 24; i++) {

        if(k >= angles[i]) { // if current k is bigger than angle of c

            k -= angles[i]; // reduce k by that number

            r = mul(r, roots[i]); // multiply the result by c
        }
    }

    return r; // here you'll have a complex number equal to cos k + i sin k.
}

float sin(float k) {

    return cosin(k).second;
}

float cos(float k) {

    return cosin(k).first;
}

Now if you still find it slow you can reduce number of iterations in function cosin (note that the precision will be reduced)

johnwbyrd hahah good point, I'll change that EDIT: Although in my code PI doesn't doesn't bare much weight, the only reason it's there is cause it's usual for trig functions to use radians as input — Fran x1024, Apr 16 '20 at 23:43

score 2 · Answer 13 · answered Jun 17 '20 at 09:04

An approximation for the sine function that preserves the derivatives at multiples of 90 degrees is given by this formula. The derivation is similar to Bhaskara I's sine approximation formula, but the constraints are to set the values and derivatives at 0, 90, and 180 degrees to that of the sine function. You can use this if you need the function to be smooth everywhere.

#define PI 3.141592653589793

double fast_sin(double x) {
    x /= 2 * PI;
    x -= (int) x;

    if (x <= 0.5) {
        double t = 2 * x * (2 * x - 1);
        return (PI * t) / ((PI - 4) * t - 1);
    }
    else {
        double t = 2 * (1 - x) * (1 - 2 * x);
        return -(PI * t) / ((PI - 4) * t - 1);
    }
}

double fast_cos(double x) {
    return fast_sin(x + 0.5 * PI);
}

As for its speed, it at least outperforms the std::sin() function by an average of 0.3 microseconds per call. And the maximum absolute error is 0.0051.

score 1 · Answer 14 · answered Mar 30 '21 at 09:42

I tried millianw's answer and it gave me a 4.5x speedup, so that's awesome.

However, the original article that millianw links to computes a sine, not a cosine, and it does it somewhat differently. (It looks simpler.)

Predictably, after 15 years the URL of that article (http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648) gives a 404 today, so I fetched it via archive.org and I am adding it here for posterity.

Unfortunately, even though the article contains several images, only the first 2 were stored by archive.org. Also, the profile page of the author (http://forum.devmaster.net/users/Nick) was not stored, so I guess we will never know who Nick is.

==================================================

Fast and accurate sine/cosine

Nick Apr '06

Hi all,

In some situations you need a good approximation of sine and cosine that runs at very high performance. One example is to implement dynamic subdivision of round surfaces, comparable to those in Quake 3. Or to implement wave motion, in case there are no vertex shaders 2.0 available.

The standard C sinf() and cosf() functions are terribly slow and offer much more precision than what we actually need. What we really want is an approximation that offers the best compromise between precision and performance. The most well known approximation method is to use a Taylor series about 0 (also known as a Maclaurin series), which for sine becomes:

x - 1/6 x^3 + 1/120 x^5 - 1/5040 x^7 + ...

When we plot this we get: taylor.gif.

The green line is the real sine, the red line is the first four terms of the Taylor series. This seems like an acceptable approximation, but lets have a closer look: taylor_zoom.gif.

It behaves very nicely up till pi/2, but after that it quickly deviates. At pi, it evaluates to -0.075 instead of 0. Using this for wave simulation will result in jerky motion which is unacceptable.

We could add another term, which in fact reduces the error significantly, but this makes the formula pretty lengthy. We already need 7 multiplications and 3 additions for the 4-term version. The taylor series just can't give us the precision and performance we're looking for.

We did however note that we need sine(pi) = 0. And there's another thing we can see from taylor_zoom.gif: this looks very much like a parabola! So let's try to find the formula of a parabola that matches it as closely as possible. The generic formula for a parabola is A + B x + C x\^2. So this gives us three degrees of freedom. Obvious choises are that we want sine(0) = 0, sine(pi/2) = 1 and sine(pi) = 0. This gives us the following three equations:

A + B 0 + C 0^2 = 0
A + B pi/2 + C (pi/2)^2 = 1
A + B pi + C pi^2 = 0

Which has as solution A = 0, B = 4/pi, C = -4/pi\^2. So our parabola approximation becomes 4/pi x - 4/pi\^2 x\^2. Plotting this we get: parabola.gif. This looks worse than the 4-term Taylor series, right? Wrong! The maximum absolute error is 0.056. Furthermore, this approximation will give us smooth wave motion, and can be calculated in only 3 multiplications and 1 addition!

Unfortunately it's not very practical yet. This is what we get in the [-pi, pi] range: negative.gif. It's quite obvious we want at least a full period. But it's also clear that it's just another parabola, mirrored around the origin. The formula for it is 4/pi x + 4/pi\^2 x\^2. So the straightforward (pseudo-C) solution is:

if(x > 0)
{
    y = 4/pi x - 4/pi^2 x^2;
}
else
{
    y = 4/pi x + 4/pi^2 x^2;
}

Adding a branch is not a good idea though. It makes the code significantly slower. But look at how similar the two parts really are. A subtraction becomes an addition depending on the sign of x. In a naive first attempt to eliminate the branch we can 'extract' the sign of x using x / abs(x). The division is very expensive but look at the resulting formula: 4/pi x - x / abs(x) 4/pi\^2 x\^2. By inverting the division we can simplify this to a very nice and clean 4/pi x - 4/pi\^2 x abs(x). So for just one extra operation we get both halves of our sine approximation! Here's the graph of this formula confirming the result: abs.gif.

Now let's look at cosine. Basic trigonometry tells us that cosine(x) = sine(pi/2 + x). Is that all, add pi/2 to x? No, we actually get the unwanted part of the parabola again: shift_sine.gif. What we need to do is 'wrap around' when x > pi/2. This can be done by subtracting 2 pi. So the code becomes:

x += pi/2;

if(x > pi)   // Original x > pi/2
{
    x -= 2 * pi;   // Wrap: cos(x) = cos(x - 2 pi)
}

y = sine(x);

Yet another branch. To eliminate it, we can use binary logic, like this:

x -= (x > pi) & (2 * pi);

Note that this isn't valid C code at all. But it should clarify how this can work. When x > pi is false the & operation zeroes the right hand part so the subtraction does nothing, which is perfectly equivalent. I'll leave it as an excercise to the reader to create working C code for this (or just read on). Clearly the cosine requires a few more operations than the sine but there doesn't seem to be any other way and it's still extremely fast.

Now, a maximum error of 0.056 is nice but clearly the 4-term Taylor series still has a smaller error on average. Recall what our sine looked like: abs.gif. So isn't there anything we can do to further increase precision at minimal cost? Of course the current version is already applicable for many situations where something that looks like a sine is just as good as the real sine. But for other situations that's just not good enough.

Looking at the graphs you'll notice that our approximation always overestimates the real sine, except at 0, pi/2 and pi. So what we need is to 'scale it down' without touching these important points. The solution is to use the squared parabola, which looks like this: squared.gif. Note how it preserves those important points but it's always lower than the real sine. So we can use a weighted average of the two to get a better approximation:

Q (4/pi x - 4/pi^2 x^2) + P (4/pi x - 4/pi^2 x^2)^2

With Q + P = 1. You can use exact minimization of either the absolute or relative error but I'll save you the math. The optimal weights are Q = 0.775, P = 0.225 for the absolute error and Q = 0.782, P = 0.218 for the relative error. I will use the former. The resulting graph is: average.gif. Where did the red line go? It's almost entirely covered by the green line, which instantly shows how good this approximation really is. The maximum error is about 0.001, a 50x improvement! The formula looks lengthy but the part between parenthesis is the same value from the parabola, which needs to be computed only once. In fact only 2 extra multiplications and 2 additions are required to achieve this precision boost.

It shouldn't come as a big surprise that to make it work for negative x as well we need a second abs() operation. The final C code for the sine becomes:

float sine(float x)
{
    const float B = 4/pi;
    const float C = -4/(pi*pi);

    float y = B * x + C * x * abs(x);

    #ifdef EXTRA_PRECISION
    //  const float Q = 0.775;
        const float P = 0.225;

        y = P * (y * abs(y) - y) + y;   // Q * y + P * y * abs(y)
    #endif
}

So we need merely 5 multiplications and 3 additions; still faster than the 4-term Taylor if we neglect the abs(), and much more precise! The cosine version just needs the extra shift and wrap operations on x.

Last but not least, I wouldn't be Nick if I didn't include the SIMD optimized assembly version. It allows to perform the wrap operation very efficiently so I'll give you the cosine:

// cos(x) = sin(x + pi/2)
addps xmm0, PI_2
movaps xmm1, xmm0
cmpnltps xmm1, PI
andps xmm1, PIx2
subps xmm0, xmm1

// Parabola
movaps xmm1, xmm0
andps xmm1, abs
mulps xmm1, xmm0
mulps xmm0, B
mulps xmm1, C
addps xmm0, xmm1

// Extra precision
movaps xmm1, xmm0
andps xmm1, abs
mulps xmm1, xmm0
subps xmm1, xmm0
mulps xmm1, P
addps xmm0, xmm1

This code computes four cosines in parallel, resulting in a peak performance of about 9 clock cycles per cosine for most CPU architectures. Sines take only 6 clock cycles ideally. The lower precision version can even run at 3 clock cycles per sine... And lets not forget that all input between -pi and pi is valid and the formula is exact at -pi, -pi/2, 0, pi/2 and pi.

So, the conclusion is don't ever again use a Taylor series to approximate a sine or cosine! To add a useful discussion to this article, I'd love to hear if anyone knows good approximations for other transcendental functions like the exponential, logarithm and power functions.

Cheers,

Nick

==================================================

You might also find the comments that follow this article interesting, by visiting the web archive page:

http://web.archive.org/web/20141220225551/http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648

John Doe · Answer 15 · 2019-07-25T19:13:37.663

Just use the FPU with inline x86 for Wintel apps. The direct CPU sqrt function is reportedly still beating any other algorithms in speed. My custom x86 Math library code is for standard MSVC++ 2005 and forward. You need separate float/double versions if you want more precision which I covered. Sometimes the compiler's "__inline" strategy goes bad, so to be safe, you can remove it. With experience, you can switch to macros to totally avoid a function call each time.

extern __inline float  __fastcall fs_sin(float x);
extern __inline double __fastcall fs_Sin(double x);
extern __inline float  __fastcall fs_cos(float x);
extern __inline double __fastcall fs_Cos(double x);
extern __inline float  __fastcall fs_atan(float x);
extern __inline double __fastcall fs_Atan(double x);
extern __inline float  __fastcall fs_sqrt(float x);
extern __inline double __fastcall fs_Sqrt(double x);
extern __inline float  __fastcall fs_log(float x);
extern __inline double __fastcall fs_Log(double x);

extern __inline float __fastcall fs_sqrt(float x) { __asm {
FLD x  ;// Load/Push input value
FSQRT
}}

extern __inline double __fastcall fs_Sqrt(double x) { __asm {
FLD x  ;// Load/Push input value
FSQRT
}}

extern __inline float __fastcall fs_sin(float x) { __asm {
FLD x  ;// Load/Push input value
FSIN
}}

extern __inline double __fastcall fs_Sin(double x) { __asm {
FLD x  ;// Load/Push input value
FSIN
}}    

extern __inline float __fastcall fs_cos(float x) { __asm {
FLD x  ;// Load/Push input value
FCOS
}}

extern __inline double __fastcall fs_Cos(double x) { __asm {
FLD x  ;// Load/Push input value
FCOS
}}

extern __inline float __fastcall fs_tan(float x) { __asm {
FLD x  ;// Load/Push input value
FPTAN
}}

extern __inline double __fastcall fs_Tan(double x) { __asm {
FLD x  ;// Load/Push input value
FPTAN
}}

extern __inline float __fastcall fs_log(float x) { __asm {
FLDLN2
FLD x
FYL2X
FSTP ST(1) ;// Pop1, Pop2 occurs on return
}}

extern __inline double __fastcall fs_Log(double x) { __asm {
FLDLN2
FLD x
FYL2X
FSTP ST(1) ;// Pop1, Pop2 occurs on return
}}

this is far slower than SSE. x87 has pretty much deprecated and you should use SSE anyway. In fact if you want to calculate sin/cos/sqrt for a lot of values then the SSE functions in Intel library can do that for multiple values at once and is significantly faster than any of the above x87 instructions — phuclv, Oct 15 '19 at 02:39
since the question doesn't need much precision, an approximate square root using [`rsqrt(x) * x`](https://stackoverflow.com/q/1528727/995714) is even faster. See also [Peter Cordes' answer above](https://stackoverflow.com/a/36660558/995714) — phuclv, Oct 15 '19 at 02:48

score 0 · Answer 16 · answered Dec 08 '20 at 17:42

Here's a possible speedup which depends highly on your application. (Your program may not be able to use this at all, but I post it here because it might.) I'm also just posting the math here, the code is up to you.

For my application, I needed to calculate the sines and cosines for every angle step (dA) around a full circle.

This made it so I could take advantage of some trig identities:

cos(-A) = cos(A)

sin(-A) = -sin(A)

This made it so I only needed to calculate sin and cos for one half of the circle.

I also setup pointers to my output arrays and this also sped up my calculations. I'm not sure of this, but I believe that my compiler vectorized my calculations.

A third was to use:

sin(A+dA) = sin(a)*cos(dA) + cos(a)*sin(dA)

cos(a+dA) = cos(a)*cos(dA) - sin(a)*sin(dA)

That made it so I only needed to actually calculate the sin and cos of one angle - the rest were calculated with two multiplies and an addition each. (This goes with the caveat that the roundoff errors in the calculations of sin(dA) and cos(dA) could accumulate by the time you get half way around the circle. Once again, your application is everything if you use this.)

score -1 · Answer 17 · answered May 11 '16 at 07:26

-1

Over 100000000 test, milianw answer is 2 time slower than std::cos implementation. However, you can manage to run it faster by doing the following steps:

->use float

->don't use floor but static_cast

->don't use abs but ternary conditional

->use #define constant for division

->use macro to avoid function call

// 1 / (2 * PI)
#define FPII 0.159154943091895
//PI / 2
#define PI2 1.570796326794896619

#define _cos(x)         x *= FPII;\
                        x -= .25f + static_cast<int>(x + .25f) - 1;\
                        x *= 16.f * ((x >= 0 ? x : -x) - .5f);
#define _sin(x)         x -= PI2; _cos(x);

Over 100000000 call to std::cos and _cos(x), std::cos run on ~14s vs ~3s for _cos(x) (a little bit more for _sin(x))

answered May 11 '16 at 07:26

Hugo Zevetel

274
4
6

all of your comments make me wonder whether you actually compiled with compiler optimizations enabled. Most notably, this function should get inlined, thus using a macro or not should not make any difference whatsoever. – milianw Nov 06 '17 at 11:40
"Calling an inline function may or may not generate a function call, which typically incurs a very small amount of overhead. The exact situations under which an inline function actually gets inlined vary depending on the compiler; most make a good-faith effort to inline small functions (at least when optimization is enabled), but there is no requirement that they do so (C99, §6.7.4):" (https://stackoverflow.com/questions/5226803/inline-function-v-macro-in-c-whats-the-overhead-memory-speed) – Hugo Zevetel Nov 13 '17 at 10:53
Right, it's compiler dependent, so what compiler do you use? Look at the assembly, it's not calling any function on clang or gcc: https://godbolt.org/g/UjAKBh I'd go as far as claiming that you can report a bug to your compiler if it's not inlining this function. Similar the compiler should do the constant devision for you, no need to obfuscate the code with a define with the constant... – milianw Nov 14 '17 at 11:18
Code was compiled with visual studio 2015. I made the test 1.5 years ago, and I can't remember if it was with optimization enabled or not (I agree I should have written this when answering). However, all the conditions I enumerated was kept because they lower the execution time. In more, it seems to me that disabled optimization is a more restrictive condition than enabled optimization, so it means the previous code is working in more case (if optimization was disabled). – Hugo Zevetel Nov 16 '17 at 08:46
Concerning microsoft behavior with inline function, miscrosoft says: "the compiler does not inline a function if its address is taken or if it is too large to inline." (https://msdn.microsoft.com/en-us/library/cx3b23a3.aspx) that is really not clear. We can also have a look to the key word __forceinline – Hugo Zevetel Nov 16 '17 at 09:11

Fastest implementation of sine, cosine and square root in C++ (doesn't need to be much accurate)

17 Answers17

Linked