Faster sin() for x64

Question

Main question

Does somebody have a fast sin() implementation for x64? It doesn't need to be pure pascal.

Explanation

I've got a VCL application that in some situations runs a lot slower when it's compiled for x64.

It does a lot of floating point 3d calculations, and I've tracked this down to the fact that System.Sin() and System.Cos() are a lot slower on x64, when the input values become large.

I've timed it by creating a simple test application that measures how long it takes to calculate sin(x), with different values for x, and the differences are HUGE:

              call:     x64:     x86:
              Sin(1)   16 ms    20 ms
             Sin(10)   30 ms    20 ms
            Sin(100)   32 ms    20 ms
           Sin(1000)   34 ms    21 ms
          Sin(10000)   30 ms    21 ms
         Sin(100000)   30 ms    16 ms
        Sin(1000000)   35 ms    20 ms
       Sin(10000000)  581 ms    20 ms
      Sin(100000000) 1026 ms    21 ms
     Sin(1000000000) 1187 ms    22 ms
    Sin(10000000000) 1320 ms    21 ms
   Sin(100000000000) 1456 ms    20 ms
  Sin(1000000000000) 1581 ms    17 ms
 Sin(10000000000000) 1717 ms    22 ms
Sin(100000000000000) 1846 ms    23 ms
           Sin(1E15) 1981 ms    21 ms
           Sin(1E16) 2100 ms    21 ms
           Sin(1E17) 2240 ms    22 ms
           Sin(1E18) 2372 ms    18 ms
                etc    etc      etc

What you see here is that sin(1E5) runs about 300 times as fast as sin(1E8).

In case you're interested, I've created the above table like this:

{$APPTYPE CONSOLE}
program SinTest;

uses Diagnostics, Math, SysUtils;

var
  i : Integer;
  x : double;
  sw: TStopwatch;

begin
  x := 1;

  while X < 1E18 do
  begin
    sw    := TStopwatch.StartNew;
    for i := 1 to 500000 do
      System.Sin(x);

    // WriteLn(System.sin(x), #9,System.Sin(fmod(x,2*pi)));

    sw.Stop;

    WriteLn('    ', ('Sin(' + round(x).ToString + ')'):20, ' ', sw.ElapsedMilliseconds,' ms');

    x := x * 10;
  end;

  WriteLn('Press any key to continue');
  readln;
end.

Notes:

There are some questions on StackOverflow regarding faster sine functions, but none of them have source code that is useful to port to Delphi, like this one: Fastest implementation of sine, cosine and square root in C++ (doesn't need to be much accurate)
The rest of the x64 runs faster than it's 32-bit counterpart
I've found a bit of crappy workaround, by doing this: Sin(FMod(x,2*pi)). It provides the correct results, and it runs fast for larger numbers. For smaller numbers it's a bit slower of course.

Presumably you don't care about accuracy or you wouldn't be calling trig functions with such large values. Surely you appreciate that roundoff means that trig functions are meaningless for such input values? Or is accuracy just not important to you? — David Heffernan, Apr 04 '16 at 20:17
So, see if you can guess the output of this program: `{$APPTYPE CONSOLE} var s1, s2: Single; begin s1 := 10000000.5; s2 := 10000000.0; Writeln(s1=s2); end. ` Here's a clue. The output is not `FALSE`. — David Heffernan, Apr 04 '16 at 20:34
It seems that MSVC can do it faster, and I'd be interested to know how because I bet it does it faster for input values that are sensible too. But for your large input values you are wasting your time even calling these trig functions, as my previous comment demonstrates. — David Heffernan, Apr 04 '16 at 20:38
It's a periodic function, so just divide the angle by 2*pi and pass the remainder to the sin function. Don't pass silly numbers to it. You can be sure the x86 function is doing that. — Mike Dunlavey, Apr 04 '16 at 21:14
@Mike That won't work any better. Once the input gets large you start with the roundoff — David Heffernan, Apr 04 '16 at 21:45
@DavidHeffernan: I understand how floating points work, that's not the problem. 100000000 is not really a meaningless number to work with though. Simple example: imagine you want to generate a sine tone for a 5 minute audio sample like this: `sin(2*pi*time)`. If you look at the table above, you'll see that the last samples will be rendered about 100 times as slow as the first ones. That's a significant difference. For most people it can be ignored, but in my situation it's a serious slowdown. The 32bits version runs fine, while the 64bits version crawls. — Wouter van Nifterick, Apr 04 '16 at 21:52
I don't think you do understand. If you did you'd be worried that the values returned by your calls to `sin` were meaningless. Because you cannot represent the numbers that you want. Doesn't it bother you that `10000000.5 = 10000000.0` in single precision? Once you fix that defect in your algorithm then the perf will be fine. And really, don't you think I of all people understand the importance of floating point perf. My work is all about that. — David Heffernan, Apr 04 '16 at 21:57
@DavidHeffernan: `10000000.5 != 10000000.0` in my case, because I'm using doubles. Performance drops way before I'm running out of floating point precision. — Wouter van Nifterick, Apr 04 '16 at 22:04
IMO, wrapping the code for `sin(fmod(x, 2 * pi))` into a little function is probably *about* as good as it gets (in fact, that's what they should have done to start with--from the timing, they probably implemented `fmod` by repeated subtraction, which is great if it's close to the right range, but slow and potentially inaccurate if it's drastically out of range). — Jerry Coffin, Apr 04 '16 at 22:06
No, you are using single precision. It's there in the question. Or is the question not the one you meant to ask? — David Heffernan, Apr 04 '16 at 22:07
@DavidHeffernan: ah, i'm sorry, I've copied the example code after I've tried what happened if I used a single, and didn't change it back. I'm using doubles in my real code. Sorry about that confusion. Performance issues are the same for singles and doubles by the way. — Wouter van Nifterick, Apr 04 '16 at 22:15
Sheez, why all the downvotes? What exactly is wrong with this question? — Wouter van Nifterick, Apr 04 '16 at 22:17
If it were me, I'd fix the problem at source. Using the x87 unit is the wrong way, in my view. But hey, what do I know?!! — David Heffernan, Apr 04 '16 at 22:19
Using something like `if (X < 1E6) then System.Sin(x) else System.Sin(FMod(X,2*Pi));` fixes this without using asm. — LU RD, Apr 04 '16 at 22:26
@DavidHeffernan Agreed - hence my use of *"rather strongly discouraged"*. I raised it as a solution mostly for curiosity and completeness. I wouldn't do it that way either. — J..., Apr 04 '16 at 22:26
@DavidHeffernan: I'm taking your every word seriously. I'm here to learn something about how to deal with this, and I'm glad that you're responding. With "at source" you mean that you would circumvent the entire problem by making sure that the input for the sin() function doesn't grow too much? — Wouter van Nifterick, Apr 04 '16 at 22:26
@LURD: maybe I should wrap something like that into a new sin() function and use that instead. — Wouter van Nifterick, Apr 04 '16 at 22:27
Yes, I would endeavour to find a way to avoid such huge inputs to sin and cos. I'd concentrate on correctness before efficiency and expect the latter to follow for free. — David Heffernan, Apr 05 '16 at 05:36
Some useful links: https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/ http://stackoverflow.com/questions/2284860/how-does-c-compute-sin-and-other-math-functions — David Heffernan, Apr 05 '16 at 07:28

J... · Accepted Answer · 2016-04-05T08:47:08.950

3

While this is probably to be rather strongly discouraged in user mode code (and is totally forbidden in kernel mode code), if you do want to keep the legacy x87 behaviour in your x64 code you could write a function like this :

function SinX87(x:double):double;
var
  d : double;
asm
  movsd qword ptr [rbp+8], xmm0
  fld qword ptr [rbp+8]
  fsin
  fstp qword ptr [rbp+8]
  movsd xmm0, qword ptr [rbp+8]
end;

This adds a bit of overhead since you have to pop the value out of the SSE register onto the stack, load it into the x87 unit, peform the calculation, pop the value back onto the stack, and then load it back into XMM0 for the function result. The sin calculation is pretty heavy, though, so this is a relatively small overhead. I would only really do this if you needed to preserve whatever idiosyncracies of the x87's sin implementation.

Other libraries exist that compute sin more efficiently in x64 code than Delphi's purepascal routines. My overwhelming preference here would be to export a good set of C++ routines to a DLL instead. Also, as David said, using trig functions with ridiculously large arguments is not really a sensible thing to do anyway.

edited Apr 05 '16 at 08:47

answered Apr 04 '16 at 21:00

J...

28,957
5
61
128

Cool, the speed is very stable, no matter what input it gets. For values less than pi it's a tiny bit slower; the rest is always faster. The results are slightly different from Delphi's System.Sin(), but it's insignificant for the numbers that I need to work with. The results look good. This is exactly what I needed. Now all I need to do is add some ugly {$ifdef} stuff, and performance under x64 is restored. Thanks! – Wouter van Nifterick Apr 04 '16 at 22:12
@WoutervanNifterick Also, I'm not sure how exceptions would be handled... I'd definitely test it first. Not sure if the x87 control word gets set to anything sensible by default in x64 mode either - I knocked this up quickly but there are caveats to watch out for. – J... Apr 04 '16 at 22:14
Tested it, and indeed it handles things a bit different. For example `SinX87(NaN)` won't raise any exception, like System.Sin() does. So there are differences indeed, but this is a great help. I'll do some additional tests, but so far this looks like it does everything exactly the way I need it. – Wouter van Nifterick Apr 04 '16 at 22:22

Wouter van Nifterick · Answer 2 · 2016-04-04T22:57:40.193

In case you're interested in my final solution:

I've experimented a bit, by doing this (as LU RD and e). – Jerry Coffin suggested):

function sin(x:double):double;
begin
  if x<1E6 then
    Result := system.sin(x)
  else
    Result := system.sin(fmod(x,2*pi));
end;

Maybe it has something to do with the predictability of the test code on my specific CPU, but smaller values were actually calculated faster if I didn't do the if, and just always use fmod(). Strange, because some division needs to take place, which I'd expect to be slowed than comparing two values.

So this is what I end up using now:

function sin(const x: double): double; { inline; }
begin
  {$IFDEF CPUX64}
  Result := System.sin(Math.FMod(x,2*pi));
  {$ELSE}
  Result := System.sin(x);
  {$ENDIF}
end;

By the way adding inline, it ran 1.5 times faster even. It then runs exactly as fast as J...'s function on my machine. But even without Inline this is already hundreds of times faster than System.Sin(), so I'm going for this.

Even if you use `fmod(x, 2*pi)`, as @DavidHeffernan pointed out, you run up against the fact that `x`, as a double precision variable, cannot hold more than about 17 decimal digits of information, so you lose all your precision of what is passed into the `sin` function. Ex: If you are stepping `x` from 100000000000000000.0 to 100000000000000000.1, representing a .1-radian step, those two numbers are the same, because when the .1 is added, it is lost because the double precision variable is not wide enough to hold the whole thing. You have to find another way to encode `x`. — Mike Dunlavey, Apr 05 '16 at 00:29

Faster sin() for x64

2 Answers2