This is a question addresed to users, experienced in SSE/AVX instruction family, and those of them, who are familiar with its performance analysis. I saw a lot of different implementations and approaches, ranging from older for SSE2 to newer ones. Web is flooded with such a links. But personally i am not deeply experienced in sse assembly analyze. Some people are pointing out to the uops, caches, and that requires some low level knowledge. So i am asking for an hints and your personal experiences. If you have some time to roll out some comparison, on "What is fastest" and why, what approaches you looked at. Implementation maybe not so precise, 10-16 bits of single FP precision is good enough. More is better, but when it does not affect speed.
PS. To try to avoid meta flood, i could describe task precisely with details:
- Given scalar argument x (in radians), that is passed in xmm register (according to x64 fastcall convention).
- Write a function with signature
__m128 sincos(float x)
; that returns its sin(x) and cos(x) values approximations. - Return value should be inside one xmm register and to be calculated in a fastest possible manner, to satisfy 10-bit precision requirement.
- Argument could be any real number (but not
nan
,inf
, so on). In case if argument normalisation is required by approach its performant implementation(fmod()) would be also the subject. But question is not about handling special FP cases.
This may be a duplicate, but i have failed to find similar question here, so please point me, if there is already one.