How to compute sincos fast on a x64 CPU?

Question

This is a question addresed to users, experienced in SSE/AVX instruction family, and those of them, who are familiar with its performance analysis. I saw a lot of different implementations and approaches, ranging from older for SSE2 to newer ones. Web is flooded with such a links. But personally i am not deeply experienced in sse assembly analyze. Some people are pointing out to the uops, caches, and that requires some low level knowledge. So i am asking for an hints and your personal experiences. If you have some time to roll out some comparison, on "What is fastest" and why, what approaches you looked at. Implementation maybe not so precise, 10-16 bits of single FP precision is good enough. More is better, but when it does not affect speed.

PS. To try to avoid meta flood, i could describe task precisely with details:

Given scalar argument x (in radians), that is passed in xmm register (according to x64 fastcall convention).
Write a function with signature __m128 sincos(float x); that returns its sin(x) and cos(x) values approximations.
Return value should be inside one xmm register and to be calculated in a fastest possible manner, to satisfy 10-bit precision requirement.
Argument could be any real number (but not nan, inf, so on). In case if argument normalisation is required by approach its performant implementation(fmod()) would be also the subject. But question is not about handling special FP cases.

This may be a duplicate, but i have failed to find similar question here, so please point me, if there is already one.

that function signature can't be optimal; you need to return 2 separate vectors for one input vector. With only one output vector, you would need to shuffle, and would only have room for 2 pairs of sin / cos results from 4 `float` elements. And most code that used the results would have to shuffle them apart. In asm, obviously just return in `xmm0` and `xmm1`. With C intrinsics, either return a struct of 2 `__m128`, or (probably better) take an output arg by reference. — Peter Cordes, Feb 25 '18 at 08:52
@Peter Cordes the case is 2D games, where is is often good to have unit vector (direction vector), constructed from angle, to further pass it to geometry algorithms such as movement integration, collision detection, ray casting and so on. On other hand, developer could remove shuffle instruction if he really does not needs it. — xakepp35, Feb 25 '18 at 08:58
In that use case, you definitely want the x and y coordinates for each unit vector in separate SIMD vectors, because you'll be using them with vertical operations. Putting multiple components of 2D or 3D vectors inside a single SIMD vector is a classic example of doing it wrong. See https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ for a nice detailed intro-to-SIMD which makes this point clearly and well. (See also other links in https://stackoverflow.com/tags/sse/info) — Peter Cordes, Feb 25 '18 at 09:02
@Peter Cordes. okay, i would agree, dot product and cross protuct could be faster, if 2 muls and add/sub is faster than one mul and hadd/hsub. are they? — xakepp35, Feb 25 '18 at 09:20
`hadd` sucks on all CPUs; it costs 2 shuffles + 1 vertical add. You also didn't mention `fma`. The best way to use SIMD is to calculate 4 separate angles in parallel (with a `__m128` input). I didn't notice at first that you only wanted a scalar `float` as the input to this function. If you really can't find a way to do multiple separate points at once (4 separate cross products in a pair of SIMD vectors), then sometimes you can get a speedup from shuffling things together to feed some `add` or something. But a smaller speedup than if you could redesign your data layout for SIMD. — Peter Cordes, Feb 25 '18 at 09:41
@Peter Cordes however, if you are storing x and y in a different __m128 variables, you are adding 2x more vertical payload, as well as using 2x ram, so performance starts to slow down on a simple vector operations like position vector scaling or addition. I tried this, i have complex task (neural network driven racing cars) single core performance equals same, just a bit lower. but multicore performance almost halves, so its better do some shuffling only in the beginning of collision detection algorithm (which involves 3 cross products) — xakepp35, Feb 25 '18 at 10:06
@Peter Cordes finally, i would end up with vec2f and hvec2f classes, where first is for storing and manipulation position velocities, and second would be just for performance of horisontal operations, having dot(), cross() methods, and so on. and just introduce shuffling conversion to construct horisontal representation from vertical one. — xakepp35, Feb 25 '18 at 10:17
Huh, why are you storing `__m128` in memory? Are you talking about storing a whole `__m128` where only the low element has an `x` value for a single point, and the rest are all zero? That completely defeats the purpose of SIMD, you might as well just write code using scalar `float`. I'm talking about having an `__m128` with *4* different `x` values, and another `__m128` with the 4 corresponding `y` values. So you don't have any wasted memory. — Peter Cordes, Feb 25 '18 at 18:32
Or if you want to store only 2 values (an `x` and `y` pair), it doesn't make sense to store a whole `__m128` in memory. Use the intrinsics for `movsd` loads / `movlps` stores to load/store from `struct { float x,y; };` into `__m128` local variables for computation, but don't have an array of `__m128` with 2 elements unused in each SIMD vector. — Peter Cordes, Feb 25 '18 at 18:35
@Peter Cordes i am talking about vertical vectors in my question, for applications like `newPos = curPos * 2 - prevPos + accelVec * dTsquared`, where accelVec is composed of a unit vector, scaled by several things. So i need to store cos and sin in same xmm in order to have advantage. Imagine n-body gravity problem, for example, where spaceships are controlled with throttle. — xakepp35, Feb 25 '18 at 19:51
Pure vertical SIMD would be computing 4 different `newPos` x values in parallel, and 4 corresponding `newPos` y values in parallel, and storing your positions as two separate `x[]` and `y[]` arrays, rather than a single array of `struct xy []`. For calculations where the x and y components don't interact, sure you can keep two xy pairs in a SIMD vector, but any time you need to use the x and y of a single object in one calculation (e.g. a distance between points), you need to shuffle instead of just doing vertical math ops and getting a SIMD vector of 4 distances. — Peter Cordes, Feb 25 '18 at 20:13
Seriously, go read https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/, and think about Array of Structs vs. Struct of Arrays. — Peter Cordes, Feb 25 '18 at 20:13
@Peter Cordes, yes, now i got the point and see the difference after several days of research and developement. I ended up refactored almost all code to avoid branching and to perform computations it a parallel. That revealed some architecture weaknesses and required additional storage for intermediate results, which was not obvious in previous serial code. And now it is more like pipeline conveyor, passing separate data arrays from stage to stage. Great performance! — xakepp35, Feb 26 '18 at 12:08
With a good system (say gcc and a recent glibc), you don't need to do anything special (maybe pass some flag like -ffast-math). The compiler will notice if you use both sin and cos and compute them together with sincos, and if it is used in a loop, it will auto-vectorize and call a vectorized version of sincos (from libmvec if you use glibc). — Marc Glisse, Feb 26 '18 at 12:16
Does this question belong on [code golf](https://codegolf.stackexchange.com/)? — gman, Feb 26 '18 at 12:23
@Marc Glisse i thought so, but no, msvc does not, and neither does intel c++ 18 compiler. simple for loop for 4096 iterations, given angles array, required to produces sines array and cosines array, aligned on 32 bytes boundaries in memory. both compilers just puts a calls to sinf() and cosf() separately, produsing awful performance even for simple syntethic test. i dont want to rely on a compiler. — xakepp35, Feb 26 '18 at 14:28

xakepp35 · Answer 1 · 2018-02-27T14:07:00.163

I have discovered great modern revision of Julien Pommier implementations, ported for AVX/AVX2 under zlib, thanks to Giovanni Garberoglio:

http://software-lisc.fbk.eu/avx_mathfun/

It works really fast, 80-90M iterations per second on single core of i7 3770k, giving 8 sines and 8 coses per iteration. compared to ~15Mhz if i call 8 sinf() and 8 cosf() per iteration (functions from msvc2017 x64 library, with avx compiler settings)

UPD: Also there is an excellent FastTrigo code samples, where FT::sincos() function is 20% faster than Julien Pommier's implementation. And his FT::sincos() provides exactly 10 bit of guranteed accuracy.

How to compute sincos fast on a x64 CPU?

1 Answers1

Linked