Fastest Implementation of Exponential Function Using AVX

Question

I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML.

Relative Accuracy should be something like ~1e-6, or ~20 mantissa bits (1 part in 2^20).

I'd be happy if it is written in C Style with Intel intrinsics.
Code should be portable (Windows, macOS, Linux, MSVC, ICC, GCC, etc...).

This is similar to Fastest Implementation of Exponential Function Using SSE, but that question is looking for very fast with low precision (The current answer there gives about 1e-3 precision).

Also, this question is looking for AVX / AVX2 (and FMA). But note that the answers on both questions are easily ported between SSE4 __m128 or AVX2 __m256, so future readers should choose based on required precision / performance trade off.

vml should be ok : https://bitbucket.org/eschnett/vecmathlib/wiki/Home — Regis Portalez, Feb 19 '18 at 10:42
Have a look at the AVX2 optimized exp function from [avx_mathfun](https://github.com/reyoung/avx_mathfun) . — wim, Feb 19 '18 at 11:37
it's pure C++ code which works well with clang. What compiler do you use? — Regis Portalez, Feb 19 '18 at 12:25
Mostly MSVC on Windows. But I need to be able to work on others as well. — Royi, Feb 19 '18 at 13:00
@wim, I tried AVX MathFun you linked. It doesn't work on Visual Studio 2017. — Royi, Feb 19 '18 at 13:00
@Royi: so port the AVX implementation of the function you need to plain intrinsics. The algorithm and choice of instructions is the important part. I haven't looked, but I assume that's not buried under too many layers of wrapper syntax. — Peter Cordes, Feb 19 '18 at 15:05
Since you are using Agner Fog's VCL why not just use his implementation which works for SSE2 through AVX512 with GCC, Clang, ICC, and MSVC? See the file `vectormath_exp.h`. — Z boson, Feb 20 '18 at 10:21
@Zboson, Because in VCL you can't impose the instructions used. For example, for the case I have a `MyFunSSE()` and `MyFunAVX()` in the same project I can't make VCL use only SSE when called form `MyFunSSE()` and allow AVX when called form `MyFunAVX()`. — Royi, Feb 20 '18 at 10:33
@Royi why can't you move your SEE and AVX function to separate source files and compile one with `-msse2` and the other with `-mavx`? — Z boson, Feb 20 '18 at 10:52
And in any case, the vectormath_exp.h file shows you how to vectorize exp if you want to write your own version. — Z boson, Feb 20 '18 at 11:39
@Zboson, I wasn't aware I can use per file options. This is a really good option. Thank you for notifying me on this +1. — Royi, Feb 20 '18 at 12:23
You should go through the file `dispatch_example.cpp`. It's short and well documented. — Z boson, Feb 20 '18 at 12:41
@Zboson Note that Agner Fog claims that the performance of `exp` in vectormath_exp.h is poor see [page 43](http://www.agner.org/optimize/vectorclass.pdf) of the document. An advantage of `avx_mathfun` is that it uses a Chebyshev approximation-like polynomial instead of a Taylor expansion, which is used by VCL. Therefore `avx_mathfun` should have a better balance between performance and precision than VCL. — wim, Feb 20 '18 at 12:43
@wim, thanks for the info. That's good to know. GCC (with `-Ofast -fopenmp` and `#pragma omp simd` will vectorize `exp` like SVML. That's a solution for GCC and ICC but not Clang. I gave up on MSVC a long time ago. — Z boson, Feb 20 '18 at 12:49
@wim, apparently GCC only vectorizes `exp` for `double` and not `float`. Strange https://godbolt.org/g/mN14F7 — Z boson, Feb 20 '18 at 13:02
@Zboson That's strange indeed. But at least manually vectorizing is a more interesting occupation then finding the right compiler hints and options for auto vectorization. — wim, Feb 20 '18 at 14:21
@wim the example I showed there uses the vector extensions. So it's really explicit vectorization and not auto-vectorization. The reason I like vector extensions is they make nice code and work on other hardware (e.g. ARM). As long as you are using vertical operations (and not horizontal operations such as shuffles) they usually work well. The vectorized math functions will improve. Clang needs to add vectorized math functions. — Z boson, Feb 20 '18 at 14:38
@Zboson Now I see. I read about the GCC vector extensions before. but I haven't used it myself. Indeed it is an interesting choice for cross platform explicit vector programming. — wim, Feb 21 '18 at 09:49
@Zboson, Do you mean those: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html? — Royi, Feb 21 '18 at 10:25
@Royi, yes, many of my answer recently have been about these if you want to read up. I started with [this answer](https://stackoverflow.com/questions/48069990/multithreaded-simd-vectorized-mandelbrot-in-r-using-rcpp-openmp/48283672#48283672) recently and there are six answer after than about vector extensions. I actually made a wrapper around them which replaces much of what the VCL does. One major advantage is that my code compiles on ARM as well. It lacks MSVC support (which I don't care about) and Intel's vector extensions support can be disappointing but otherwise it's be a success. — Z boson, Feb 21 '18 at 11:46
It turns out you don't need to use `omp simd` anymore for vectorized math with GCC starting with glib 2.23. That's good to know https://sourceware.org/glibc/wiki/libmvec https://godbolt.org/g/EhoJLV — Z boson, Feb 21 '18 at 11:52
@Zboson, I don't understand, the first code with omp simd will be also multi threaded? Won't it? — Royi, Feb 21 '18 at 11:55
@Royi, no, because it does not use `parallel` it does not use a team of threads. — Z boson, Feb 21 '18 at 11:58
It seems at least with vector extensions `omp simd` is still necessary often. — Z boson, Feb 21 '18 at 12:05
I think ICC 18 vectorize code like that with vectorized math to begin with. — Royi, Feb 21 '18 at 12:08
The questions aren't exact duplicates, but njuffa's answer is probably still accurate to 1e-6, and the SSE4 version of that answer can be trivially extended to `__m256` / `__m256i` for an AVX2 version. Maybe if you needed an AVX1-only version you'd have a substantially different question (because integer ops of the same vector width wouldn't be available). I debated just leaving a comment or edit, but decided to close as duplicate because njuffa's answer is very good (he knows what he's doing with FP math and SIMD.) The answer here is maybe useful too if it has higher accuracy. — Peter Cordes, Feb 21 '18 at 23:10
@PeterCordes, I checked njuffa's answer. It has much lower accuracy (Checked vs. MATLAB). — Royi, Feb 21 '18 at 23:49
Yes, but (an AVX2 version of it) is faster, right? Is it accurate *enough*? — Peter Cordes, Feb 22 '18 at 03:21
Why would it have higher accuracy? Its accuracy vs MATLAB is ~1e-3. Not that good. — Royi, Feb 22 '18 at 11:11

wim · Accepted Answer · 2018-02-21T10:10:08.537

The exp function from avx_mathfun uses range reduction in combination with a Chebyshev approximation-like polynomial to compute 8 exp-s in parallel with AVX instructions. Use the right compiler settings to make sure that addps and mulps are fused to FMA instructions, where possible.

It is quite straightforward to adapt the original exp code from avx_mathfun to portable (across different compilers) C / AVX2 intrinsics code. The original code uses gcc style alignment attributes and ingenious macro's. The modified code, which uses the standard _mm256_set1_ps() instead, is below the small test code and the table. The modified code requires AVX2.

The following code is used for a simple test:

int main(){
    int i;
    float xv[8];
    float yv[8];
    __m256 x = _mm256_setr_ps(1.0f, 2.0f, 3.0f ,4.0f ,5.0f, 6.0f, 7.0f, 8.0f);
    __m256 y = exp256_ps(x);
    _mm256_store_ps(xv,x);
    _mm256_store_ps(yv,y);

    for (i=0;i<8;i++){
        printf("i = %i, x = %e, y = %e \n",i,xv[i],yv[i]);
    }
    return 0;
}

The output seems to be ok:

i = 0, x = 1.000000e+00, y = 2.718282e+00 
i = 1, x = 2.000000e+00, y = 7.389056e+00 
i = 2, x = 3.000000e+00, y = 2.008554e+01 
i = 3, x = 4.000000e+00, y = 5.459815e+01 
i = 4, x = 5.000000e+00, y = 1.484132e+02 
i = 5, x = 6.000000e+00, y = 4.034288e+02 
i = 6, x = 7.000000e+00, y = 1.096633e+03 
i = 7, x = 8.000000e+00, y = 2.980958e+03

The modified code (AVX2) is:

#include <stdio.h>
#include <immintrin.h>
/*     gcc -O3 -m64 -Wall -mavx2 -march=broadwell  expc.c    */

__m256 exp256_ps(__m256 x) {
/* Modified code. The original code is here: https://github.com/reyoung/avx_mathfun

   AVX implementation of exp
   Based on "sse_mathfun.h", by Julien Pommier
   http://gruntthepeon.free.fr/ssemath/
   Copyright (C) 2012 Giovanni Garberoglio
   Interdisciplinary Laboratory for Computational Science (LISC)
   Fondazione Bruno Kessler and University of Trento
   via Sommarive, 18
   I-38123 Trento (Italy)
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.
  (this is the zlib license)
*/
/* 
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc. The modified code requires AVX2
*/
__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341);
__m256   cephes_exp_C1 = _mm256_set1_ps(0.693359375);
__m256   cephes_exp_C2 = _mm256_set1_ps(-2.12194440e-4);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   tmp           = _mm256_setzero_ps(), fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx    = _mm256_mul_ps(x, cephes_LOG2EF);
        fx    = _mm256_add_ps(fx, _mm256_set1_ps(0.5f));
        tmp   = _mm256_floor_ps(fx);
__m256  mask  = _mm256_cmp_ps(tmp, fx, _CMP_GT_OS);    
        mask  = _mm256_and_ps(mask, one);
        fx    = _mm256_sub_ps(tmp, mask);
        tmp   = _mm256_mul_ps(fx, cephes_exp_C1);
__m256  z     = _mm256_mul_ps(fx, cephes_exp_C2);
        x     = _mm256_sub_ps(x, tmp);
        x     = _mm256_sub_ps(x, z);
        z     = _mm256_mul_ps(x,x);

__m256  y     = cephes_exp_p0;
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p1);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p2);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p3);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p4);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p5);
        y     = _mm256_mul_ps(y, z);
        y     = _mm256_add_ps(y, x);
        y     = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0  = _mm256_cvttps_epi32(fx);
        imm0  = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0  = _mm256_slli_epi32(imm0, 23);
__m256  pow2n = _mm256_castsi256_ps(imm0);
        y     = _mm256_mul_ps(y, pow2n);
        return y;
}

int main(){
    int i;
    float xv[8];
    float yv[8];
    __m256 x = _mm256_setr_ps(1.0f, 2.0f, 3.0f ,4.0f ,5.0f, 6.0f, 7.0f, 8.0f);
    __m256 y = exp256_ps(x);
    _mm256_store_ps(xv,x);
    _mm256_store_ps(yv,y);

    for (i=0;i<8;i++){
        printf("i = %i, x = %e, y = %e \n",i,xv[i],yv[i]);
    }
    return 0;
}

As @Peter Cordes points out, it should be possible to replace the _mm256_floor_ps(fx + 0.5f) by _mm256_round_ps(fx). Moreover, the mask = _mm256_cmp_ps(tmp, fx, _CMP_GT_OS); and the next two lines seem to be redundant. Further optimizations are possible by combining cephes_exp_C1 and cephes_exp_C2 into inv_LOG2EF. This leads to the following code which has not been tested thoroughly!

#include <stdio.h>
#include <immintrin.h>
#include <math.h>
/*    gcc -O3 -m64 -Wall -mavx2 -march=broadwell  expc.c -lm     */

__m256 exp256_ps(__m256 x) {
/* Modified code from this source: https://github.com/reyoung/avx_mathfun

   AVX implementation of exp
   Based on "sse_mathfun.h", by Julien Pommier
   http://gruntthepeon.free.fr/ssemath/
   Copyright (C) 2012 Giovanni Garberoglio
   Interdisciplinary Laboratory for Computational Science (LISC)
   Fondazione Bruno Kessler and University of Trento
   via Sommarive, 18
   I-38123 Trento (Italy)
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.
  (this is the zlib license)

*/
/* 
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc.
  Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
  This modified code is not thoroughly tested!
*/


__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256   inv_LOG2EF    = _mm256_set1_ps(0.693147180559945f);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx     = _mm256_mul_ps(x, cephes_LOG2EF);
        fx     = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256  z      = _mm256_mul_ps(fx, inv_LOG2EF);
        x      = _mm256_sub_ps(x, z);
        z      = _mm256_mul_ps(x,x);

__m256  y      = cephes_exp_p0;
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p1);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p2);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p3);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p4);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p5);
        y      = _mm256_mul_ps(y, z);
        y      = _mm256_add_ps(y, x);
        y      = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0   = _mm256_cvttps_epi32(fx);
        imm0   = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0   = _mm256_slli_epi32(imm0, 23);
__m256  pow2n  = _mm256_castsi256_ps(imm0);
        y      = _mm256_mul_ps(y, pow2n);
        return y;
}

int main(){
    int i;
    float xv[8];
    float yv[8];
    __m256 x = _mm256_setr_ps(11.0f, -12.0f, 13.0f ,-14.0f ,15.0f, -16.0f, 17.0f, -18.0f);
    __m256 y = exp256_ps(x);
    _mm256_store_ps(xv,x);
    _mm256_store_ps(yv,y);

 /* compare exp256_ps with the double precision exp from math.h, 
    print the relative error             */
    printf("i      x                     y = exp256_ps(x)      double precision exp        relative error\n\n");
    for (i=0;i<8;i++){ 
        printf("i = %i  x =%16.9e   y =%16.9e   exp_dbl =%16.9e   rel_err =%16.9e\n",
           i,xv[i],yv[i],exp((double)(xv[i])),
           ((double)(yv[i])-exp((double)(xv[i])))/exp((double)(xv[i])) );
    }
    return 0;
}

The next table gives an impression of the accuracy in certain points, by comparing exp256_ps with the double precision exp from math.h . The relative error is in the last column.

i      x                     y = exp256_ps(x)      double precision exp        relative error

i = 0  x = 1.000000000e+00   y = 2.718281746e+00   exp_dbl = 2.718281828e+00   rel_err =-3.036785947e-08
i = 1  x =-2.000000000e+00   y = 1.353352815e-01   exp_dbl = 1.353352832e-01   rel_err =-1.289636419e-08
i = 2  x = 3.000000000e+00   y = 2.008553696e+01   exp_dbl = 2.008553692e+01   rel_err = 1.672817689e-09
i = 3  x =-4.000000000e+00   y = 1.831563935e-02   exp_dbl = 1.831563889e-02   rel_err = 2.501162103e-08
i = 4  x = 5.000000000e+00   y = 1.484131622e+02   exp_dbl = 1.484131591e+02   rel_err = 2.108215155e-08
i = 5  x =-6.000000000e+00   y = 2.478752285e-03   exp_dbl = 2.478752177e-03   rel_err = 4.380257261e-08
i = 6  x = 7.000000000e+00   y = 1.096633179e+03   exp_dbl = 1.096633158e+03   rel_err = 1.849522682e-08
i = 7  x =-8.000000000e+00   y = 3.354626242e-04   exp_dbl = 3.354626279e-04   rel_err =-1.101575118e-08

Why are they using `floor(fx+0.5)` instead of [`_mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC)`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=1825,5214,433,720,4251,4464,4464&text=mm256_round_ps)? Is it really important to round half-way cases always towards -Inf instead of the usual nearest-even? Or to introduce [weirdness for inputs like 1 ulp below 0.5](https://stackoverflow.com/questions/485525/round-for-float-in-c/24348037?noredirect=1#comment23470590_485546). — Peter Cordes, Feb 19 '18 at 16:34
Maybe whoever wrote this wasn't aware than `vroundps` existed, and was porting scalar code that used `floor`. And then they compare `floor(fx+0.5) > (fx+0.5)` and conditionally subtract `1.0` from the rounded result? Is that some kind of fixup against rounding errors introduced by the crappy `+0.5` in the first place? Actually I don't see how the compare is ever true, isn't `floor( f )` strictly `<= f`? — Peter Cordes, Feb 19 '18 at 16:48
@PeterCordes Good points! Indeed the original [Cephes code from netlib](http://www.netlib.org/cephes/) uses `floor(fx+0.5)`. Somehow they made a weird intrinsics translation. I don't understand the compare `floor(fx+0.5) > (fx+0.5)` either. Moreover I don't see the point of subtracting both `fx*exp_c1` and `fx*exp_c2` from `x` instead of `fx*exp_c3`, with `exp_c3=exp_c1+exp_c2` at once. — wim, Feb 19 '18 at 22:42
Maybe the original had something else to subtract, and split the constant into a large and small parts for extra FP precision. e.g. `x -= fx*big_part;` `x -= something_else;` `x -= fx * small_part;`. This is a real thing that gets done, e.g. [in Agner Fog's VCL `log_f` function](https://github.com/darealshinji/vectorclass/blob/master/vectormath_exp.h#L967). Doing it in 2 steps back-to-back doesn't make much sense to me, unless maybe the big part of the constant has a lot of trailing zeros in the mantissa? Does that even matter? Can this somehow mitigate catastrophic cancellation? IDK. — Peter Cordes, Feb 19 '18 at 22:56
Anyway, the OP wants something fast with relative error up to 1e-6 being acceptable, and that's very far from needing all 23 significand bits correct, so extra-precision tricks should be dropped (if that's what they are). Unless there's a danger zone around cutoffs between exponent steps or something. — Peter Cordes, Feb 19 '18 at 23:01
@PeterCordes I see. Indeed `0.693359375` has a floating point representation with 15 zero bits at the end. So, multiplying `0.693359375` with a small integer should be accurate and this big/small split may help a bit to improve the accuracy indeed. — wim, Feb 19 '18 at 23:30
Then that's probably it. Of course, if FMA is available you have no rounding in the temporary at all. :) — Peter Cordes, Feb 19 '18 at 23:32
@Royi Thanks. Note that the previous version of the optimized code still contained the line ` fx = _mm256_add_ps(fx, _mm256_set1_ps(0.5f));`, which is/was wrong. The current answer should be correct. — wim, Feb 20 '18 at 00:18
Someone needs to take care of SSE MathFun and AVX MathFun. Just to make sure they are compatible with current systems. I'd donate to such project. — Royi, Feb 20 '18 at 13:34
@wim, You might want convert your solution to SSE for this: https://stackoverflow.com/questions/47025373. — Royi, Feb 20 '18 at 14:23
I am reasonably certain that `_mm256_slli_epi32` requires AVX2. It might be worthwhile to specifically point that out in the answer. — njuffa, Feb 20 '18 at 20:34
@Royi I'am afraid that would be too much recycling the same answer. It shouldn't be too difficult to write an SSE version yourself: Use search and replace to: 1. replace `__m256i` by `__m128i`, 2. replace `__m256` by `__m128`, 3. replace `_mm256_` by `_mm_` 4. replace `si256` by `si128` Or: Start with the original [sse_mathfun](http://gruntthepeon.free.fr/ssemath/) code and apply the same changes to that code as I did to the avx_mathfun code. — wim, Feb 21 '18 at 09:53
The `exp256_ps()` code at the bottom of the answer (with cephes_exp_C1 and cephes_exp_C2 combined into inv_LOG2EF) has a maximum relative error of 3.0e-7, for all floats `x` in [-84,84], if FMA is used where possible (compile with FMA enabled). With AVX2, but without FMA the maximum relative error is 4.1e-6, for `x` in [-84,84]. — wim, Mar 05 '18 at 13:29
@wim, If I write an SSE4 equivalent, what should I put instead of `_mm_cmp_ps(tmp, fx, _CMP_GT_OS)`? — Royi, Apr 06 '18 at 15:04

njuffa · Answer 2 · 2018-03-06T23:33:57.643

Since fast computation of exp() requires manipulation of the exponent field of IEEE-754 floating-point operands, AVX is not really suitable for this computation, as it lacks integer operations. I will therefore focus on AVX2. Support for fused-multiply add is technically a feature separate from AVX2, therefore I provide two code paths, with and without use of FMA, controlled by the macro USE_FMA.

The code below computes exp() to nearly the desired accuracy of 10^-6. Use of FMA doesn't provide any significant improvement here, but it should provide a performance advantage on platforms which support it.

The algorithm used in a previous answer for a lower-precision SSE implementation is not completely extensible to a fairly accurate implementation, as it contains some computation with poor numerical properties which, however, does not matter in that context. Instead of computing e^x = 2ⁱ * 2^f, with f in [0,1] or f in [-½, ½], it is advantageous to compute e^x = 2ⁱ * e^f with f in the narrower interval [-½log 2, ½log 2], where log denotes the natural logarithm.

To do so, we first compute i = rint(x * log2(e)), then f = x - log(2) * i. Importantly, the latter computation needs to employ higher than native precision to deliver an accurate reduced argument to be passed to the core approximation. For this, we use a Cody-Waite scheme, first published in W. J. Cody & W. Waite, "Software Manual for the Elementary Functions", Prentice Hall 1980. The constant log(2) is split into a "high" portion of larger magnitude and a "low" portion of much smaller magnitude that holds the difference between the "high" portion and the mathematical constant.

The high portion is chosen with sufficient trailing zero bits in the mantissa, such that the product of i with the "high" portion is exactly representable in native precision. Here I have chosen a "high" portion with eight trailing zero bits, as i will certainly fit into eight bits.

In essence, we compute f = x - i * log(2)_high - i * log(2)_low. This reduced argument is passed into the core approximation, which is a polynomial minimax approximation, and the result is scaled by 2ⁱ as in the previous answer.

#include <immintrin.h>

#define USE_FMA 0

/* compute exp(x) for x in [-87.33654f, 88.72283] 
   maximum relative error: 3.1575e-6 (USE_FMA = 0); 3.1533e-6 (USE_FMA = 1)
*/
__m256 faster_more_accurate_exp_avx2 (__m256 x)
{
    __m256 t, f, p, r;
    __m256i i, j;

    const __m256 l2e = _mm256_set1_ps (1.442695041f); /* log2(e) */
    const __m256 l2h = _mm256_set1_ps (-6.93145752e-1f); /* -log(2)_hi */
    const __m256 l2l = _mm256_set1_ps (-1.42860677e-6f); /* -log(2)_lo */
    /* coefficients for core approximation to exp() in [-log(2)/2, log(2)/2] */
    const __m256 c0 =  _mm256_set1_ps (0.041944388f);
    const __m256 c1 =  _mm256_set1_ps (0.168006673f);
    const __m256 c2 =  _mm256_set1_ps (0.499999940f);
    const __m256 c3 =  _mm256_set1_ps (0.999956906f);
    const __m256 c4 =  _mm256_set1_ps (0.999999642f);

    /* exp(x) = 2^i * e^f; i = rint (log2(e) * x), f = x - log(2) * i */
    t = _mm256_mul_ps (x, l2e);      /* t = log2(e) * x */
    r = _mm256_round_ps (t, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC); /* r = rint (t) */

#if USE_FMA
    f = _mm256_fmadd_ps (r, l2h, x); /* x - log(2)_hi * r */
    f = _mm256_fmadd_ps (r, l2l, f); /* f = x - log(2)_hi * r - log(2)_lo * r */
#else // USE_FMA
    p = _mm256_mul_ps (r, l2h);      /* log(2)_hi * r */
    f = _mm256_add_ps (x, p);        /* x - log(2)_hi * r */
    p = _mm256_mul_ps (r, l2l);      /* log(2)_lo * r */
    f = _mm256_add_ps (f, p);        /* f = x - log(2)_hi * r - log(2)_lo * r */
#endif // USE_FMA

    i = _mm256_cvtps_epi32(t);       /* i = (int)rint(t) */

    /* p ~= exp (f), -log(2)/2 <= f <= log(2)/2 */
    p = c0;                          /* c0 */
#if USE_FMA
    p = _mm256_fmadd_ps (p, f, c1);  /* c0*f+c1 */
    p = _mm256_fmadd_ps (p, f, c2);  /* (c0*f+c1)*f+c2 */
    p = _mm256_fmadd_ps (p, f, c3);  /* ((c0*f+c1)*f+c2)*f+c3 */
    p = _mm256_fmadd_ps (p, f, c4);  /* (((c0*f+c1)*f+c2)*f+c3)*f+c4 ~= exp(f) */
#else // USE_FMA
    p = _mm256_mul_ps (p, f);        /* c0*f */
    p = _mm256_add_ps (p, c1);       /* c0*f+c1 */
    p = _mm256_mul_ps (p, f);        /* (c0*f+c1)*f */
    p = _mm256_add_ps (p, c2);       /* (c0*f+c1)*f+c2 */
    p = _mm256_mul_ps (p, f);        /* ((c0*f+c1)*f+c2)*f */
    p = _mm256_add_ps (p, c3);       /* ((c0*f+c1)*f+c2)*f+c3 */
    p = _mm256_mul_ps (p, f);        /* (((c0*f+c1)*f+c2)*f+c3)*f */
    p = _mm256_add_ps (p, c4);       /* (((c0*f+c1)*f+c2)*f+c3)*f+c4 ~= exp(f) */
#endif // USE_FMA

    /* exp(x) = 2^i * p */
    j = _mm256_slli_epi32 (i, 23); /* i << 23 */
    r = _mm256_castsi256_ps (_mm256_add_epi32 (j, _mm256_castps_si256 (p))); /* r = p * 2^i */

    return r;
}

If higher accuracy is required, the degree of the polynomial approximation can be bumped up by one, using the following set of coefficients:

/* maximum relative error: 1.7428e-7 (USE_FMA = 0); 1.6586e-7 (USE_FMA = 1) */
const __m256 c0 =  _mm256_set1_ps (0.008301110f);
const __m256 c1 =  _mm256_set1_ps (0.041906696f);
const __m256 c2 =  _mm256_set1_ps (0.166674897f);
const __m256 c3 =  _mm256_set1_ps (0.499990642f);
const __m256 c4 =  _mm256_set1_ps (0.999999762f);
const __m256 c5 =  _mm256_set1_ps (1.000000000f);

That is quite an accurate approximation for such a low degree polynomial! With some experiments with the code in [my answer](https://stackoverflow.com/a/48869291/2439725), I found that the Cody-Waite trick is not always necessary when a high accuracy is not required, in particular when FMA is used to compute the reduced argument. With AVX2 only (no FMA), the second `exp256_ps(x)` function (at the bottom of my answer) has a maximum relative error of 4.1e-6, for all floats `x` in [-84,84]. With FMA enabled, by the compiler, the maximum relative error is 3.0e-7, for `x` in [-84,84]. — wim, Mar 05 '18 at 13:18
@wim I used a minimax approximation, which may not be the case for approximations used in other codes. I concur that the use of FMA can sometimes make the use of the Cody-Waite trick unnecessary. I am still experimenting with an FMA-enhanced version of my code above. Initial indications are it doesn't seem to buy much of additional accuracy here. — njuffa, Mar 05 '18 at 16:21
The `i = (int)r` comment doesn't match the code: `cvtps_epi32` is a round-to-nearest conversion, not C truncate-toward-zero (that would be [`cvttps_epi32`, with an extra t for truncate](https://github.com/HJLebbink/asm-dude/wiki/CVTTPS2DQ)). C doesn't have any nice way to express round and convert to `int` (`lrint` returns a `long`), but the most accurate way to represent the operation would be `(int) rint(r)`. Oh, I see `r` is the result of a `roundps`. Shorten that dep chain (but not the critical path) by using `_mm256_cvtps_epi32(t)` instead of `r`. — Peter Cordes, Mar 06 '18 at 05:11
@PeterCordes I concur with your observations. I am not sure the minor increase in ILP causes a practical performance difference, since this is off the critical path as noted; I am also not sure about impact from potential difference in port usage from that change (I am not nearly as well versed with the intricate details of multiple generations of x86 CPU as you are :-) If performance data shows that use of `_mm256_cvtps_epi32(t)` is superior, I'd be more than happy to apply the change. — njuffa, Mar 06 '18 at 05:22
It gives the CPU the option of running it earlier, sometime when there's a spare cycle on the same port(s) that handle FP add and round instructions. This may cause more or fewer resource conflicts (where the conversion steals a cycle from the critical path). Hmm, on Haswell/Skylake, `vroundps` is 2 uops (for p1/p01), so **converting to integer and back is the same latency and throughput as an actual `vroundps` on Haswell+**. Doing it that way gives you the rounded integer result for free! (And I think it's break even on Bulldozer / Ryzen, and Sandybridge) — Peter Cordes, Mar 06 '18 at 05:31

jenkas · Answer 3 · 2020-02-09T09:15:30.777

I played a lot with this, and discovered this one, that has relative accuracy about ~1-07e and simple to convert to vector instructions. Having only 4 constants, 5 multiplications and 1 division this is twice as fast as built-in exp() function.

float fast_exp(float x)
{
    const float c1 = 0.007972914726F;
    const float c2 = 0.1385283768F;
    const float c3 = 2.885390043F;
    const float c4 = 1.442695022F;      
    x *= c4; //convert to 2^(x)
    int intPart = (int)x;
    x -= intPart;
    float xx = x * x;
    float a = x + c1 * xx * x;
    float b = c3 + c2 * xx;
    float res = (b + a) / (b - a);
    reinterpret_cast<int &>(res) += intPart << 23; // res *= 2^(intPart)
    return res;
}

Converting to AVX (updated)

__m256 _mm256_exp_ps(__m256 _x)
{
    __m256 c1 = _mm256_set1_ps(0.007972914726F);
    __m256 c2 = _mm256_set1_ps(0.1385283768F);
    __m256 c3 = _mm256_set1_ps(2.885390043F);
    __m256 c4 = _mm256_set1_ps(1.442695022F);
    __m256 x = _mm256_mul_ps(_x, c4); //convert to 2^(x)
    __m256 intPartf = _mm256_round_ps(x, _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC);
    x = _mm256_sub_ps(x, intPartf);
    __m256 xx = _mm256_mul_ps(x, x);
    __m256 a = _mm256_add_ps(x, _mm256_mul_ps(c1, _mm256_mul_ps(xx, x))); //can be improved with FMA
    __m256 b = _mm256_add_ps(c3, _mm256_mul_ps(c2, xx));
    __m256 res = _mm256_div_ps(_mm256_add_ps(b, a), _mm256_sub_ps(b, a));
    __m256i intPart = _mm256_cvtps_epi32(intPartf); //res = 2^intPart. Can be improved with AVX2!
    __m128i ii0 = _mm_slli_epi32(_mm256_castsi256_si128(intPart), 23);
    __m128i ii1 = _mm_slli_epi32(_mm256_extractf128_si256(intPart, 1), 23);     
    __m128i res_0 = _mm_add_epi32(ii0, _mm256_castsi256_si128(_mm256_castps_si256(res)));
    __m128i res_1 = _mm_add_epi32(ii1, _mm256_extractf128_si256(_mm256_castps_si256(res), 1));
    return _mm256_insertf128_ps(_mm256_castsi256_ps(_mm256_castsi128_si256(res_0)), _mm_castsi128_ps(res_1), 1);
}

`_mm256_insertf128_ps(_mm256_setzero_ps(), _mm_castsi128_ps(res_0), 0)` is redundant; you don't need to zero-extend it by inserting into a zeroed vector when you're already about to insert a new high half. Just cast. And BTW, most modern CPUs also have AVX2 so you don't need to unpack to 128-bit at all, unless you need to run on Bulldozer-family and SnB/IvB. Haswell and later, and Ryzen (and even Excavator APUs) have AVX2. Or low-power Intel (Silvermont-family) doesn't even have AVX, nor do modern Pentium / Celeron chips. So yes there are some AVX1-only CPUs around, but getting rarer. — Peter Cordes, Feb 07 '20 at 21:20
You are right, but this code for AVX instructions. I wrote in code comments where it may be improved with AVX2 and FMA — jenkas, Feb 07 '20 at 21:24
The first part of my comment still applies to AVX1-only. You only need one `_mm256_insertf128_ps` (with immediate `1`). Your last line is overcomplicated. — Peter Cordes, Feb 08 '20 at 06:46

score 0 · Answer 4 · edited Feb 21 '18 at 16:34

0

You can approximate the exponent yourself with Taylor series:

exp(z) = 1 + z + pow(z,2)/2 + pow(z,3)/6 + pow(z,4)/24 + ...

For that you need only addition and multiplication operations from AVX. Coefficients like 1/2, 1/6, 1/24 etc. are faster if hard-coded and then multiplied by rather than divided.

Take as many members of the sequence as required by your precision. Note that you will get relative error: for small z it may be 1e-6 in the absolute, but for large z it will be more than 1e-6 in the absolute, still abs(E-E1)/abs(E) - 1 is smaller than 1e-6 (where E is the precise exponent and E1 is what you get with approximation).

UPDATE: As @Peter Cordes has mentioned in a comment, precision can be improved by separating exponentiation of integer and fractional parts, handling the integer part by manipulating the exponent field of the binary float representation (which is based on 2^x, not e^x). Then your Taylor series only has to minimize error over a small range.

edited Feb 21 '18 at 16:34

Peter Cordes

245,674
35
423
606

answered Feb 19 '18 at 11:34

Serge Rogatch

11,119
4
58
117

1

In practice this will not be very efficient, unless z is very close to zero(!) Moreover, for negative values of z there might arise serious accuracy problems. Efficient and accurate exp approximations should use at least some form of range reduction. For function approximation usually (Rational) Chebyshev approximations are used instead of Taylor series, because thay have better numerical properties. – wim Feb 19 '18 at 11:54
@wim, Any idea on implementation? – Royi Feb 19 '18 at 11:58
Serge, I know the Taylor approximation. But I think in practice some other tricks are used. Hence I wanted a code. Thank You. – Royi Feb 19 '18 at 11:59
2

@Royi Just try [avx_mathfun](https://github.com/reyoung/avx_mathfun) and see if it works for your application. Surprisingly, it uses range reduction and a simple Taylor approximation! – wim Feb 19 '18 at 12:23
How can you accurately represent the 1/N! when N is large enough? – Regis Portalez Feb 19 '18 at 12:27
@RegisPortalez, you will not get to large enough N because usually for approximation only a few first members are used. Still, floating-point approach allows to represent numbers close to 0 well. – Serge Rogatch Feb 19 '18 at 12:34
@Royi I just took a close look at avx_mathfun. It doesn't use a Taylor polynomial, as expected. It uses some Chebyshev like polynomial approximation which tries to minimize the error across some interval. – wim Feb 19 '18 at 13:30
@wim, I said nothing about avx_mathfun. I'm OK with any method. The problem is it doesn't work on Visual Studio 2017. It seems no on is taking care of that code. – Royi Feb 19 '18 at 13:44
It won't even compile. Some issues with the macros (`#define...`) not properly set. – Royi Feb 19 '18 at 14:16
1

Instead of the macros to set the constants, just use standard `_mm256_set1_ps()` and `_mm256_set1_epi32()` etc. – wim Feb 19 '18 at 14:39
For example `cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);` – wim Feb 19 '18 at 14:40
2

For the Taylor expansion to not suck, you should only use it for the fractional part of the input. Take the integer part and stuff it into the exponent field of a `float` to get 2^x. (An extra multiply in there somewhere can account for 2^x vs. e^x, I think). With a polynomial for only the fractional part, you only have to minimax the error over a *much* smaller range. This is the same trick in reverse that you use for log(x): extract the exponent of the input to get log2(integer_part(x)). – Peter Cordes Feb 21 '18 at 16:00

Fastest Implementation of Exponential Function Using AVX

4 Answers4

Linked