Fastest Implementation of the Natural Exponential Function Using SSE

Question

I'm looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ).

I have an implementation which is quick but seems to be very low in accuracy:

static inline __m128 FastExpSse(__m128 x)
{
    __m128 a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
    __m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
    __m128  m87 = _mm_set1_ps(-87);
    // fast exponential function, x should be in [-87, 87]
    __m128 mask = _mm_cmpge_ps(x, m87);

    __m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
    return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}

Could anybody have an implementation with better accuracy yet as fast (Or faster)?

I'd be happy if it is written in C Style.

Thank You.

What is the accuracy of your current implementation (maximum relative error)? And what accuracy are you targeting for an improved version? — njuffa, Oct 30 '17 at 22:55
It is really bad (I'd even suspect there is an error, if someone could spot it). So probably anything which makes sense will beat it. Something below 1% would be great! — Royi, Oct 30 '17 at 22:56
See: [sse_mathfun](https://github.com/RJVB/sse_mathfun) and [this answer](https://stackoverflow.com/a/8907932/253056) (relates to `log2` but most of the suggestions also include an `exp` function). — Paul R, Oct 31 '17 at 07:41
@Royi: there are several different options in the linked answer, but in general they are probably somewhat more accurate and somewhat slower. So it depends on whereabouts on the accuracy-versus-performance curve you want to be. — Paul R, Oct 31 '17 at 09:23
Related: [Fastest Implementation of Exponential Function Using AVX](https://stackoverflow.com/questions/48863719/fastest-implementation-of-exponential-function-using-avx) has an alternate implementation with more polynomial terms. — Peter Cordes, Feb 21 '18 at 20:45
Why did they close the other one? They are certainly not "Duplicate". — Royi, Feb 21 '18 at 22:42

score 25 · Accepted Answer · edited May 01 '20 at 09:03

The C code below is a translation into SSE intrinsics of an algorithm I used in a previous answer to a similar question.

The basic idea is to transform the computation of the standard exponential function into computation of a power of 2: expf (x) = exp2f (x / logf (2.0f)) = exp2f (x * 1.44269504). We split t = x * 1.44269504 into an integer i and a fraction f, such that t = i + f and 0 <= f <= 1. We can now compute 2^f with a polynomial approximation, then scale the result by 2ⁱ by adding i to the exponent field of the single-precision floating-point result.

One problem that exists with an SSE implementation is that we want to compute i = floorf (t), but there is no fast way to compute the floor() function. However, we observe that for positive numbers, floor(x) == trunc(x), and that for negative numbers, floor(x) == trunc(x) - 1, except when x is a negative integer. However, since the core approximation can handle an f value of 1.0f, using the approximation for negative arguments is harmless. SSE provides an instruction to convert single-precision floating point operands to integers with truncation, so this solution is efficient.

Peter Cordes points out that SSE4.1 supports a fast floor function _mm_floor_ps(), so a variant using SSE4.1 is also shown below. Not all toolchains automatically predefine the macro __SSE4_1__ when SSE 4.1 code generation is enabled, but gcc does.

Compiler Explorer (Godbolt) shows that gcc 7.2 compiles the code below into sixteen instructions for plain SSE and twelve instructions for SSE 4.1.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <emmintrin.h>
#ifdef __SSE4_1__
#include <smmintrin.h>
#endif

/* max. rel. error = 1.72863156e-3 on [-87.33654, 88.72283] */
__m128 fast_exp_sse (__m128 x)
{
    __m128 t, f, e, p, r;
    __m128i i, j;
    __m128 l2e = _mm_set1_ps (1.442695041f);  /* log2(e) */
    __m128 c0  = _mm_set1_ps (0.3371894346f);
    __m128 c1  = _mm_set1_ps (0.657636276f);
    __m128 c2  = _mm_set1_ps (1.00172476f);

    /* exp(x) = 2^i * 2^f; i = floor (log2(e) * x), 0 <= f <= 1 */   
    t = _mm_mul_ps (x, l2e);             /* t = log2(e) * x */
#ifdef __SSE4_1__
    e = _mm_floor_ps (t);                /* floor(t) */
    i = _mm_cvtps_epi32 (e);             /* (int)floor(t) */
#else /* __SSE4_1__*/
    i = _mm_cvttps_epi32 (t);            /* i = (int)t */
    j = _mm_srli_epi32 (_mm_castps_si128 (x), 31); /* signbit(t) */
    i = _mm_sub_epi32 (i, j);            /* (int)t - signbit(t) */
    e = _mm_cvtepi32_ps (i);             /* floor(t) ~= (int)t - signbit(t) */
#endif /* __SSE4_1__*/
    f = _mm_sub_ps (t, e);               /* f = t - floor(t) */
    p = c0;                              /* c0 */
    p = _mm_mul_ps (p, f);               /* c0 * f */
    p = _mm_add_ps (p, c1);              /* c0 * f + c1 */
    p = _mm_mul_ps (p, f);               /* (c0 * f + c1) * f */
    p = _mm_add_ps (p, c2);              /* p = (c0 * f + c1) * f + c2 ~= 2^f */
    j = _mm_slli_epi32 (i, 23);          /* i << 23 */
    r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/
    return r;
}

int main (void)
{
    union {
        float f[4];
        unsigned int i[4];
    } arg, res;
    double relerr, maxrelerr = 0.0;
    int i, j;
    __m128 x, y;

    float start[2] = {-0.0f, 0.0f};
    float finish[2] = {-87.33654f, 88.72283f};

    for (i = 0; i < 2; i++) {

        arg.f[0] = start[i];
        arg.i[1] = arg.i[0] + 1;
        arg.i[2] = arg.i[0] + 2;
        arg.i[3] = arg.i[0] + 3;
        do {
            memcpy (&x, &arg, sizeof(x));
            y = fast_exp_sse (x);
            memcpy (&res, &y, sizeof(y));
            for (j = 0; j < 4; j++) {
                double ref = exp ((double)arg.f[j]);
                relerr = fabs ((res.f[j] - ref) / ref);
                if (relerr > maxrelerr) {
                    printf ("arg=% 15.8e  res=%15.8e  ref=%15.8e  err=%15.8e\n", 
                            arg.f[j], res.f[j], ref, relerr);
                    maxrelerr = relerr;
                }
            }   
            arg.i[0] += 4;
            arg.i[1] += 4;
            arg.i[2] += 4;
            arg.i[3] += 4;
        } while (fabsf (arg.f[3]) < fabsf (finish[i]));
    }
    printf ("maximum relative errror = %15.8e\n", maxrelerr);
    return EXIT_SUCCESS;
}

An alternative design for fast_sse_exp() extracts the integer portion of the adjusted argument x / log(2) in round-to-nearest mode, using the well-known technique of adding the "magic" conversion constant 1.5 * 2²³ to force rounding in the correct bit position, then subtracting out the same number again. This requires that the SSE rounding mode in effect during the addition is "round to nearest or even", which is the default. wim pointed out in comments that some compilers may optimize out the addition and subtraction of the conversion constant cvt as redundant when aggressive optimization is used, interfering with the functionality of this code sequence, so it is recommended to inspect the machine code generated. The approximation interval for computation of 2^f is now centered around zero, since -0.5 <= f <= 0.5, requiring a different core approximation.

/* max. rel. error <= 1.72860465e-3 on [-87.33654, 88.72283] */
__m128 fast_exp_sse (__m128 x)
{
    __m128 t, f, p, r;
    __m128i i, j;

    const __m128 l2e = _mm_set1_ps (1.442695041f); /* log2(e) */
    const __m128 cvt = _mm_set1_ps (12582912.0f);  /* 1.5 * (1 << 23) */
    const __m128 c0 =  _mm_set1_ps (0.238428936f);
    const __m128 c1 =  _mm_set1_ps (0.703448006f);
    const __m128 c2 =  _mm_set1_ps (1.000443142f);

    /* exp(x) = 2^i * 2^f; i = rint (log2(e) * x), -0.5 <= f <= 0.5 */
    t = _mm_mul_ps (x, l2e);             /* t = log2(e) * x */
    r = _mm_sub_ps (_mm_add_ps (t, cvt), cvt); /* r = rint (t) */
    f = _mm_sub_ps (t, r);               /* f = t - rint (t) */
    i = _mm_cvtps_epi32 (t);             /* i = (int)t */
    p = c0;                              /* c0 */
    p = _mm_mul_ps (p, f);               /* c0 * f */
    p = _mm_add_ps (p, c1);              /* c0 * f + c1 */
    p = _mm_mul_ps (p, f);               /* (c0 * f + c1) * f */
    p = _mm_add_ps (p, c2);              /* p = (c0 * f + c1) * f + c2 ~= exp2(f) */
    j = _mm_slli_epi32 (i, 23);          /* i << 23 */
    r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/
    return r;
}

The algorithm for the code in the question appears to be taken from the work of Nicol N. Schraudolph, which cleverly exploits the semi-logarithmic nature of IEEE-754 binary floating-point formats:

N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862.

After removal of the argument clamping code, it reduces to just three SSE instructions. The "magical" correction constant 486411 is not optimal for minimizing maximum relative error over the entire input domain. Based on simple binary search, the value 298765 seems to be superior, reducing maximum relative error for FastExpSse() to 3.56e-2 vs. maximum relative error of 1.73e-3 for fast_exp_sse().

/* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */
__m128 FastExpSse (__m128 x)
{
    __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
    __m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765);
    __m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b);
    return _mm_castsi128_ps (t);
}

Schraudolph's algorithm basically uses the linear approximation 2^f ~= 1.0 + f for f in [0,1], and its accuracy could be improved by adding a quadratic term. The clever part of Schraudolph's approach is computing 2ⁱ * 2^f without explicitly separating the integer portion i = floor(x * 1.44269504) from the fraction. I see no way to extend that trick to a quadratic approximation, but one can certainly combine the floor() computation from Schraudolph with the quadratic approximation used above:

/* max. rel. error <= 1.72886892e-3 on [-87.33654, 88.72283] */
__m128 fast_exp_sse (__m128 x)
{
    __m128 f, p, r;
    __m128i t, j;
    const __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
    const __m128i m = _mm_set1_epi32 (0xff800000); /* mask for integer bits */
    const __m128 ttm23 = _mm_set1_ps (1.1920929e-7f); /* exp2(-23) */
    const __m128 c0 = _mm_set1_ps (0.3371894346f);
    const __m128 c1 = _mm_set1_ps (0.657636276f);
    const __m128 c2 = _mm_set1_ps (1.00172476f);

    t = _mm_cvtps_epi32 (_mm_mul_ps (a, x));
    j = _mm_and_si128 (t, m);            /* j = (int)(floor (x/log(2))) << 23 */
    t = _mm_sub_epi32 (t, j);
    f = _mm_mul_ps (ttm23, _mm_cvtepi32_ps (t)); /* f = (x/log(2)) - floor (x/log(2)) */
    p = c0;                              /* c0 */
    p = _mm_mul_ps (p, f);               /* c0 * f */
    p = _mm_add_ps (p, c1);              /* c0 * f + c1 */
    p = _mm_mul_ps (p, f);               /* (c0 * f + c1) * f */
    p = _mm_add_ps (p, c2);              /* p = (c0 * f + c1) * f + c2 ~= 2^f */
    r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/
    return r;
}

@Royi Note that I posted my test framework along with `fast_exp_sse()` iself, so you should be able to verify my accuracy claim and test your existing function as well, for a side-by-side comparison. — njuffa, Oct 30 '17 at 23:27
For code-style, I'd highly recommend `_mm_set1_ps()` for the constant vectors in the first code block as well. Initializing a `__m128` with a braced initializer is not even guaranteed to be portable, but I think it does work on gcc/clang and I assume MSVC. But repeating each constant 4 times in the source is not nice. — Peter Cordes, Nov 01 '17 at 11:20
SSE4.1 `roundps` provides a fast [`_mm_floor_ps ()`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5450,4247,3115,3856,4002,1365,2294,1724,3978,1365,1370,1439,2386&techs=SSE2,SSE4_1&text=floor), so you could [`#ifdef __SSE4_1__`](https://stackoverflow.com/questions/28939652/how-to-detect-sse-avx-avx2-avx-512-availability-at-compile-time) that part to use a more efficient `floor` when it's enable at compile time. (MSVC doesn't define that macro, so you need other checks there...) — Peter Cordes, Nov 01 '17 at 11:26
On preliminary test on my machine it seems your code is even faster than the code in the question. — Royi, Nov 01 '17 at 12:26
@PeterCordes, could you post your code which is optimized for SSE4? Thank You. — Royi, Nov 01 '17 at 12:42
@PeterCordes Thanks for the tip about `_mm_set1_ps()`, I will adjust the code. I last used SSE intrinsics ten years ago (before switching to work with CUDA on GPUs) so I am a bit rusty. I used the Intel compiler version 13 when creating this code. — njuffa, Nov 01 '17 at 14:26
@Royi: the variable names aren't exactly easy to search for to see what's needed later and what isn't. Based on the text and comments, I think you can just replace the truncate / subtract-sign-bit / convert back to float stuff with `__m128 e = _mm_floor_ps(t);` Then convert to `__m128i` with rounding or truncation, doesn't matter because the values are already integers. (BTW, `roundps` takes the rounding mode as an immediate operand. `_mm_floor_ps` passes the right constant to get round toward -Inf, and gives you an IEEE `floorf` for each element.) — Peter Cordes, Nov 01 '17 at 15:19
AVX512 has some cool stuff for messing around with exponent/mantissa. e.g. extract the exponent field *as a float* with `vgetexp`. I think there's something that scales by a power of 2, but I forget (so many new instructions). AVX512ER (Xeon Phi only) actually has a built-in `vexp2ps` instruction that gives you an approximate 2^x with less than 2^-23 relative error. But for Skylake-AVX512, something like this would still be useful. — Peter Cordes, Nov 01 '17 at 15:26
@PeterCordes Note that my code requires `floor(t)` both as an integer `i` (to be added to the exponent field later) and a floating-point number `e` (for the computation of the "fraction" `f`). So that would give us: `e = _mm_floor_ps (t); i = _mm_cvtps_epi32 (e);`. — njuffa, Nov 01 '17 at 15:44
@njuffa, What if more accuracy is needed? Do you understand what, for instance, Agner Fog did in his implementation (Which is only ~10% slower yet much more accurate). Thank You. — Royi, Feb 19 '18 at 00:36
@Royi I tend not to look at other people's code, and while I have read Mr. Fog's optimization guides in the past, I have never looked at his code. More accuracy can be achieved by using a polynomial of higher degree, and by using FMA (fused multiply-add) where available. That can easily get you to a faithfully-rounded implementation. As for performance, as opposed to Mr. Fog I am not an expert SSE programmer: I switched to CUDA/GPUs a decade ago. — njuffa, Feb 19 '18 at 15:20
Great code! (already upvoted long ago :) )Note that one has to be careful with selecting the compiler options. With some compilers, adding and subtracting `cvt = _mm_set1_ps (12582912.0f);` might optimize away with the -Ofast option, for example. — wim, Feb 21 '18 at 10:38
Wow, this is great! What changes would be needed to get a "double" version of `FastExpSse`? I mean `__m128d FastExpSse (__m128d x)`. — ThreeStarProgrammer57, Jun 04 '18 at 15:47
@ThreeStarProgrammer57 The easiest way to find out would be to ask a new question. Before you do, make sure there isn't already an existing Q&A for it. If you decide to go ahead with a new question, don't forget to state your accuracy requirements as that will likely influence the design. — njuffa, Jun 04 '18 at 17:47
@njuffa Thanks, I went ahead with this question: https://stackoverflow.com/questions/50697711/fast-sse-low-precision-exponential-using-double-precision-operations — ThreeStarProgrammer57, Jun 05 '18 at 12:43
@njuffa, Any chance you have a look at - https://codereview.stackexchange.com/questions/203356? Thank You. — Royi, Sep 08 '18 at 18:19
@Royi I am not on Code Review. Use the profiler (VTune) to guide your optimization efforts. The code looks at least partially memory-bound to me. — njuffa, Sep 08 '18 at 19:33
I thought that once you have ID to SE you have for them all. Do you have any thought about the cause of the extra overhead when combining Gaussian Blur with the other code? — Royi, Sep 08 '18 at 19:41
I used the last portion of code from the post but it created a bug, where a random index of x would receive nan value. I modified it to use m256. With `std::cout` found that it seems to occur because of `t = _mm_cvtps_epi32()`, anywhere where argument is negative. Only occurs in Release mode. I am using visual studio 2019, i7 devil's canyon. I decided to clamp x at the start of the function, but nans still kept appearing `x = _mm256_max_ps(x, _mm256_set1_ps(-50.0f));` and `x = _mm256_min_ps(x, _mm256_set1_ps(50.0f));`. This issue doesn't happen with `BetterFastExpSse()` by @NicSchraudolph — Kari, Jun 28 '19 at 11:15
If AVX available, `_mm_broadcast_ss` seems to be faster than `_mm_set1_ps`, since that is not a sequence — Iris Technologies, Jun 21 '20 at 14:07

Nic Schraudolph · Answer 2 · 2018-05-18T14:12:42.373

9

A good increase in accuracy in my algorithm (implementation FastExpSse in the answer above) can be obtained at the cost of an integer subtraction and floating-point division by using FastExpSse(x/2)/FastExpSse(-x/2) instead of FastExpSse(x). The trick here is to set the shift parameter (298765 above) to zero so that the piecewise linear approximations in the numerator and denominator line up to give you substantial error cancellation. Roll it into a single function:

__m128 BetterFastExpSse (__m128 x)
{
  const __m128 a = _mm_set1_ps ((1 << 22) / float(M_LN2));  // to get exp(x/2)
  const __m128i b = _mm_set1_epi32 (127 * (1 << 23));       // NB: zero shift!
  __m128i r = _mm_cvtps_epi32 (_mm_mul_ps (a, x));
  __m128i s = _mm_add_epi32 (b, r);
  __m128i t = _mm_sub_epi32 (b, r);
  return _mm_div_ps (_mm_castsi128_ps (s), _mm_castsi128_ps (t));
}

(I'm not a hardware guy - how bad a performance killer is the division here?)

If you need exp(x) just to get y = tanh(x) (e.g. for neural networks), use FastExpSse with zero shift as follows:

a = FastExpSse(x);
b = FastExpSse(-x);
y = (a - b)/(a + b);

to get the same type of error cancellation benefit. The logistic function works similarly, using FastExpSse(x/2)/(FastExpSse(x/2) + FastExpSse(-x/2)) with zero shift. (This is just to show the principle - you obviously don't want to evaluate FastExpSse multiple times here, but roll it into a single function along the lines of BetterFastExpSse above.)

I did develop a series of higher-order approximations from this, ever more accurate but also slower. Unpublished but happy to collaborate if anyone wants to give them a spin.

And finally, for some fun: use in reverse gear to get FastLogSse. Chaining that with FastExpSse gives you both operator and error cancellation, and out pops a blazingly fast power function...

edited May 18 '18 at 14:12

answered May 16 '18 at 21:10

Nic Schraudolph

141
2
3

1

[One division mixed in with a lot of multiplies is the best case](https://stackoverflow.com/questions/4125033/floating-point-division-vs-floating-point-multiplication/45899202#45899202). `divps` has worse latency than a multiply, but starting a division doesn't block the multiply / FMA units, so if you're doing multiple exp() calls on independent data, it's going to be about as cheap as another multiply. (And thus much better than using on large polynomial or double-precision or w/e.) The main cost here is from evaluating the same polynomial twice, not the division at the end, on modern x86. – Peter Cordes May 16 '18 at 22:04
@technosaurus I've removed `auto`. This is generic code, I wasn't meaning to imply any particular language. – Nic Schraudolph May 17 '18 at 17:23
1

@Peter no need to evaluate a polynomial twice - since the argument is merely negated, it's just an extra integer subtraction. I've expanded the code snippet to clarify. Good news about the division! – Nic Schraudolph May 17 '18 at 18:41
Oh, I hadn't realized how simple the rest of the code was. This probably will bottleneck on `divps` throughput on most CPUs, especially a 256-bit AVX version, if you're doing *just* this over an array. But to feed another calculation it might still be fine. @njuffa's 2nd version `fast_exp_sse`with relative error `1.72886892e-3` might be faster, especially on Haswell (where FMA is available for very fast polylnomials, but `divps` isn't as fast as Skylake). – Peter Cordes May 17 '18 at 19:31
Nice. For completeness: After changing the last line of `BetterFastExpSse()` to `return _mm_div_ps (_mm_castsi128_ps (s), _mm_castsi128_ps (t));` I observed a maximum relative error of 1.049e-2 for arguments in [-87.33654, 88.72283]. – njuffa May 18 '18 at 04:26
@Peter okay - so I guess this is mostly of interest for neural network applications, where the division is needed anyway (and can be co-opted here) to get a logistic or tanh function. – Nic Schraudolph May 18 '18 at 13:11
That or cases where FastLog is only part of what you do for each element. That's not rare. – Peter Cordes May 18 '18 at 16:02
@NicSchraudolph, Any chance you have a look at https://codereview.stackexchange.com/questions/203356 ? Thank You. – Royi Sep 08 '18 at 18:20

Nic Schraudolph · Answer 3 · 2018-05-23T23:24:25.127

5

Going back through my notes from way back then, I did explore ways to improve the accuracy without using division. I used the same reinterpret-as-float trick but applied a polynomial correction to the mantissa which was essentially calculated in 16-bit fixed-point arithmetic (the only way to do it fast back then).

The cubic resp. quartic versions give you 4 resp. 5 significant digits of accuracy. There was no point increasing the order beyond that, as the noise of the low-precision arithmetic then starts to drown out the error of the polynomial approximation. Here are the plain C versions:

#include <stdint.h>

float fastExp3(register float x)  // cubic spline approximation
{
    union { float f; int32_t i; } reinterpreter;

    reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23);
    int32_t m = (reinterpreter.i >> 7) & 0xFFFF;  // copy mantissa
    // empirical values for small maximum relative error (8.34e-5):
    reinterpreter.i +=
         ((((((((1277*m) >> 14) + 14825)*m) >> 14) - 79749)*m) >> 11) - 626;
    return reinterpreter.f;
}

float fastExp4(register float x)  // quartic spline approximation
{
    union { float f; int32_t i; } reinterpreter;

    reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23);
    int32_t m = (reinterpreter.i >> 7) & 0xFFFF;  // copy mantissa
    // empirical values for small maximum relative error (1.21e-5):
    reinterpreter.i += (((((((((((3537*m) >> 16)
        + 13668)*m) >> 18) + 15817)*m) >> 14) - 80470)*m) >> 11);
    return reinterpreter.f;
}

The quartic one obeys (fastExp4(0f) == 1f), which can be important for fixed-point iteration algorithms.

How efficient are these integer multiply-shift-add sequences in SSE? On architectures where float arithmetic is just as fast, one could use that instead, reducing the arithmetic noise. This would essentially yield cubic and quartic extensions of @njuffa's answer above.

edited May 23 '18 at 23:24

answered May 19 '18 at 12:36

Nic Schraudolph

141
2
3

`M_LN2` is a `double`, so the resulting asm has to convert float to double. With `(float)((1 << 23)/M_LN2)`, gcc7.3 auto-vectorizes these pretty reasonably in a loop (https://godbolt.org/g/vEWofG), with SSE4.1 or AVX2 (`vcvttps2dq` packed FP->int conversion, `vpmulld` packed 32-bit multiply / `vpsrad` packed arithmetic right shift), but clang doesn't autovectorize at all; it still converts to scalar. So you'd need to manually vectorize for this to be good with clang, I guess. – Peter Cordes May 19 '18 at 19:34
I haven't looked at the speed vs. precision tradeoff or this vs. others, but `vpmulld` has one per 2 clock throughput on Haswell (1/4 the throughput of FP mul or FMA). It's still 2 uops on Skylake, but SKL can run it on p0 / p1 for a throughput of one per clock. But it takes up more execution resources. This might be good on Nehalem or maybe Sandybridge where 128-bit `pmulld` is 1 uop, and FP division is more expensive. Static analysis of these with IACA ([What is IACA and how do I use it?](https://stackoverflow.com/q/26021337)) would be interesting and not difficult – Peter Cordes May 19 '18 at 19:38
@Royi sorry I don't know enough SSE/AVX for that. – Nic Schraudolph May 20 '18 at 15:28
2

After *replacing* `(1 << 23)/M_LN2` with `12102203.0f`, an exhaustive test of all IEEE-754 `binary32` floating-point numbers in [-87.33654, 88.72283] shows: max. rel. error `fastExp3()` = 8.34e-5; max. rel. error `fastExp4()` = 1.21e-5 – njuffa May 22 '18 at 16:53
1

@njuffa I suspect the higher errors you find are due to the inaccurate constant - can you try the exact value `12102203.161561f`? I've edited my answer to use that, in order to avoid the problems with `(1 << 23)/M_LN2` that you and @Peter describe. Cheers – Nic Schraudolph May 23 '18 at 12:41
@Nic Schraudolph The original constant `(1 << 23)/M_LN2` is a *double-precision* constant, which is undesirable here from a performance perspective. I switched the code to the closest *single-precision* constant, and this will naturally increase the overall error. The error bounds I stated were meant to provide additional information (in the context of this change), *not* challenge the bounds stated in the code comments (which are based on the use of double-precision computation). – njuffa May 23 '18 at 14:19
In other words, as `binary32`: `12102203.161561f == 12102203.0f == 0x1.715476p+23` – njuffa May 23 '18 at 14:25
@njuffa gotcha. I thought I had checked for the nearest float but was fooled by gcc interpreting `printf("%f\n", (1 << 23)/M_LN2);` to print a double instead of implicitly casting to float as I had assumed. D'oh! Fixed the code to include your constant and error bounds, as float makes more sense in this context. – Nic Schraudolph May 23 '18 at 23:31
@NicSchraudolph: `printf` can't print a `float` anyway: variadic functions use the implicit arg promotions, which includes promoting float to double. `"%f"` is the printf conversion for a `double`. But yeah `(float)((1 << 23)/M_LN2)` would have rounded to float first and *then* promoted that to double, instead of going directly from integer to double without rounding to float first. – Peter Cordes Jun 23 '19 at 14:51

score 1 · Answer 4 · answered Jun 23 '19 at 09:35

1

There is a paper about creating fast versions of these equations (tanh, cosh, artanh, sinh, etc):

http://ijeais.org/wp-content/uploads/2018/07/IJAER180702.pdf "Creating a Compiler Optimized Inlineable Implementation of Intel Svml Simd Intrinsics"

their tanh equation 6, on page 9 is very similar to @NicSchraudolph answer

answered Jun 23 '19 at 09:35

Kari

956
7
21

2

Normally link-only answers aren't welcome on SO. This arguably belongs as a comment, but I'm not going to flag it because it's potentially useful enough. – Peter Cordes Jun 23 '19 at 14:53

score 1 · Answer 5 · answered Mar 19 '21 at 16:31

For softmax use, I'm envisioning the flow as:

auto a = _mm_mul_ps(x, _mm_set1_ps(12102203.2f));
auto b = _mm_castsi128_ps(_mm_cvtps_epi32(a)); // so far as in other variants

// copy 9 MSB from 0x3f800000 over 'b' so that 1 <= c < 2
//  - also  1 <= poly_eval(...) < 2
auto c = replace_exponent(b, _mm_set1_ps(1.0f));
auto d = poly_eval(c, kA, kB, kC);  // 2nd degree polynomial
auto e = replace_exponent(d, b);    // restore exponent : 2^i * 2^f

The exponent copying can be done as bitwise select using a proper mask (AVX-512 has vpternlogd, and I'm using actually Arm Neon vbsl).

All the input values x must be negative and clamped between -17-f(N) <= x <= -f(N), so that when scaled by (1<<23)/log(2), the maximum sum of the N resulting floating point values do not reach infinity and that the reciprocal does not become denormal. For N=3, f(N) = 4. Larger f(N) will trade off input precision.

The polyeval coefficients are generated for example by polyfit([1 1.5 2],[1 sqrt(2) 2]), with kA=0.343146, kB=-0.029437, kC=0.68292, producing strictly values smaller than 2 and preventing discontinuities. The maximum average error can be diminished by evaluating the polynomial at x=[1+max_err 1.5-eps 2], y=[1 2^(.5-eps) 2-max_err].

For strictly SSE/AVX, exponent replacement for 1.0f can be done by (x & 0x007fffff) | 0x3f800000). A two instruction sequence for the latter exponent replacement can be found by ensuring that poly_eval(x) evaluates to a range, which can be directly ored with b & 0xff800000.

Fastest Implementation of the Natural Exponential Function Using SSE

5 Answers5

Linked