2

I'm making a code that essentially takes advantage of SSE2 on optimizing this code:

double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
    pC[sampleIndex] = exp((mMin + std::clamp(pA[sampleIndex] + pB[sampleIndex], 0.0, 1.0) * mRange) * ln2per12);
}

in this:

double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];

// SSE2
__m128d bound_lower = _mm_set1_pd(0.0);
__m128d bound_upper = _mm_set1_pd(1.0);
__m128d rangeLn2per12 = _mm_set1_pd(mRange * ln2per12);
__m128d minLn2per12 = _mm_set1_pd(mMin * ln2per12);

__m128d loaded_a = _mm_load_pd(pA);
__m128d loaded_b = _mm_load_pd(pB);
__m128d result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);

double *pCEnd = pC + roundintup8(blockSize);
for (; pC < pCEnd; pA += 8, pB += 8, pC += 8) {
    _mm_store_pd(pC, result);

    loaded_a = _mm_load_pd(pA + 2);
    loaded_b = _mm_load_pd(pB + 2);
    result = _mm_add_pd(loaded_a, loaded_b);
    result = _mm_max_pd(bound_lower, result);
    result = _mm_min_pd(bound_upper, result);
    result = _mm_mul_pd(rangeLn2per12, result);
    result = _mm_add_pd(minLn2per12, result);
    _mm_store_pd(pC + 2, result);

    loaded_a = _mm_load_pd(pA + 4);
    loaded_b = _mm_load_pd(pB + 4);
    result = _mm_add_pd(loaded_a, loaded_b);
    result = _mm_max_pd(bound_lower, result);
    result = _mm_min_pd(bound_upper, result);
    result = _mm_mul_pd(rangeLn2per12, result);
    result = _mm_add_pd(minLn2per12, result);
    _mm_store_pd(pC + 4, result);

    loaded_a = _mm_load_pd(pA + 6);
    loaded_b = _mm_load_pd(pB + 6);
    result = _mm_add_pd(loaded_a, loaded_b);
    result = _mm_max_pd(bound_lower, result);
    result = _mm_min_pd(bound_upper, result);
    result = _mm_mul_pd(rangeLn2per12, result);
    result = _mm_add_pd(minLn2per12, result);
    _mm_store_pd(pC + 6, result);

    loaded_a = _mm_load_pd(pA + 8);
    loaded_b = _mm_load_pd(pB + 8);
    result = _mm_add_pd(loaded_a, loaded_b);
    result = _mm_max_pd(bound_lower, result);
    result = _mm_min_pd(bound_upper, result);
    result = _mm_mul_pd(rangeLn2per12, result);
    result = _mm_add_pd(minLn2per12, result);
}

And I would say it works pretty well. BUT, can't find any exp function for SSE2, to complete the chain of operations.

Reading this, it seems I need to call standard exp() from library?

Really? Isn't this penalizing? Any other ways? Different builtin function?

I'm on MSVC, /arch:SSE2, /O2, producing 32-bit code.

markzzz
  • 44,504
  • 107
  • 265
  • 458
  • 1
    How accurate does it need to be? There are, of course, various hacks with various trade-offs in accuracy/time – harold Dec 20 '18 at 16:13
  • Maybe approximation of the exponent will work? – Dmytro Dadyka Dec 20 '18 at 16:18
  • 1
    Also https://stackoverflow.com/questions/47025373/fastest-implementation-of-exponential-function-using-sse – Dmytro Dadyka Dec 20 '18 at 16:33
  • @Matthieu Brucher: what about this? https://software.intel.com/en-us/ipp-dev-reference-exp – markzzz Dec 20 '18 at 16:36
  • Yes, ipp and svml (ICC only) are also good candidates as they propose vectorized options. svml would switch to whichever OSA you have, ipp as well, but with more potential tuning (you can remove some implementations if I remember properly). – Matthieu Brucher Dec 20 '18 at 16:37
  • 1
    `pC + 1` overlaps with `pC` and `pC+2`. Remember each vector is 2 `double`s wide, and you're using `double*` not `__m128d*`. – Peter Cordes Dec 20 '18 at 17:35
  • 2
    See also [sse_mathfun.h](http://gruntthepeon.free.fr/ssemath/). It is written with SSE intrinsics. Its origin is the scalar cephes library. – wim Dec 20 '18 at 19:09
  • @PeterCordes: true! Corrected, thanks! – markzzz Dec 21 '18 at 08:23

3 Answers3

6

The simplest way is to use exponent approximation. One possible case based on this limit

enter image description here

For n = 256 = 2^8:

__m128d fastExp1(__m128d x)
{
   __m128d ret = _mm_mul_pd(_mm_set1_pd(1.0 / 256), x);
   ret = _mm_add_pd(_mm_set1_pd(1.0), ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   return ret;
}

The other idea is the polynomial expansion. In particular, taylor series expansion:

enter image description here

__m128d fastExp2(__m128d x)
{
   const __m128d a0 = _mm_set1_pd(1.0);
   const __m128d a1 = _mm_set1_pd(1.0);
   const __m128d a2 = _mm_set1_pd(1.0 / 2);
   const __m128d a3 = _mm_set1_pd(1.0 / 2 / 3);
   const __m128d a4 = _mm_set1_pd(1.0 / 2 / 3 / 4);
   const __m128d a5 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5);
   const __m128d a6 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6);
   const __m128d a7 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6 / 7);

   __m128d ret = _mm_fmadd_pd(a7, x, a6);
   ret = _mm_fmadd_pd(ret, x, a5); 
   // If fma extention is not present use
   // ret = _mm_add_pd(_mm_mul_pd(ret, x), a5);
   ret = _mm_fmadd_pd(ret, x, a4);
   ret = _mm_fmadd_pd(ret, x, a3);
   ret = _mm_fmadd_pd(ret, x, a2);
   ret = _mm_fmadd_pd(ret, x, a1);
   ret = _mm_fmadd_pd(ret, x, a0);
   return ret;
}

Note that with the same number of expansion terms, you can get a better approximation if you approximate the function for the specific x range, using for example the least squares method.

All of these methods works in a very limited x range but with continuous derivatives which may be important in some cases.

There is a trick to approximate an exponent in a very wide range but with a noticeable piecewise linear regions. It is based on integers reinterpretation as floating-point numbers. For a more accurate description, I recommend this refs:

Piecewise linear approximation to exponential and logarithm

A Fast, Compact Approximation of the Exponential Function

The possible implementation of this approach:

__m128d fastExp3(__m128d x)
{
   const __m128d a = _mm_set1_pd(1.0 / M_LN2);
   const __m128d b = _mm_set1_pd(3 * 1024.0 - 1.05);
   __m128d t = _mm_fmadd_pd(x, a, b);
   return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(t), 11));
}

Despite the simplicity and wide x range for this method, be careful when used in math. In small areas, it gives a piecewise approximation, which can disrupt sensitive algorithms, especially those using differentiation.

To compare the accuracy of different methods, look at the graphics. The first graph is made for the x = [0..1) range. As you can see, the best approximation in this case is given by the method fastExp2(x), slightly worse but acceptable is fastExp1(x). The worst approximation provides by fastExp3(x) - the piecewise stucrure is noticeable, discontinuities of the first derivative is presence.

enter image description here In the range x = [0..10) fastExp3(x) method provides the best approximation, a bit worse is approximation given by fastExp1(x) - with the same number of calculations, it provides more order than fastExp2(x). enter image description here

The next step is to improve the accuracy of the fastExp3(x) algorithm. The easiest way to significantly increase accuracy is to use equality exp(x) = exp(x/2)/exp(-x/2) Although it increases the amount of computation, it greatly reduces the error due to mutual error compensation when dividing.

__m128d fastExp5(__m128d x)
{
   const __m128d ap = _mm_set1_pd(0.5 / M_LN2);
   const __m128d an = _mm_set1_pd(-0.5 / M_LN2);
   const __m128d b = _mm_set1_pd(3 * 1024.0 - 1.05);
   __m128d tp = _mm_fmadd_pd(x, ap, b);
   __m128d tn = _mm_fmadd_pd(x, an, b);
   tp = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tp), 11));
   tn = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tn), 11));
   return _mm_div_pd(tp, tn);
}

enter image description here

Even greater accuracy can be achieved by combining methods from fastExp1(x) or fastExp2(x) and fastExp3(x) algorithms using equality exp(x+dx) = exp(x) *exp(dx). As shown above, the first multiplier can be computed similar to fastExp3(x) approach, for second multiplier fastExp1(x) or fastExp2(x) method can be used. Finding of the optimal solution in this case is quite a difficult task and I would recommend to look at the implementation in the libraries proposed in answers.

Dmytro Dadyka
  • 2,065
  • 5
  • 14
  • 28
  • 1
    [Fastest Implementation of Exponential Function Using SSE](https://stackoverflow.com/q/47025373) shows how to take advantage of the `significand * 2^exponent` nature of floating point representation, using a polynomial approximation over a small range for the significand and stuffing the right integer into the exponent field. This is the "normal" way to implement an exp() function, AFAIK. It's very efficient, especially if you can use FMA instructions for the polynomial. I'd definitely recommend that first, over your `(x/256 + 1)^8` approximation. – Peter Cordes Dec 30 '18 at 02:44
  • Nice graphs, that addition is worth an upvote. Still, I would have suggested the normal efficient methods first, before suggesting the purely polynomial (over the whole range) approaches. As your last graph shows, taking advantage of the logarithmic nature of FP is essential for accuracy over a wide range of inputs, using a taylor expansion (or a custom fitted polynomial that minimizes max error) over a range like 0..1 or 1..2 or something, after range-reduction by taking out the exponent part. – Peter Cordes Dec 30 '18 at 07:34
  • Those more mathematical approaches are maybe fun and useful as alternative / background, or like you said for special cases where a contiguous derivative is important, not *just* minimax error over some input range. `fastExp3` is brutally simple; with more steps to the approximation like in the linked Q&A, you can get much better accuracy over a wide range for similar speed cost to what the code you show is doing. – Peter Cordes Dec 30 '18 at 07:37
  • @PeterCordes, Thank you for your clarifications. I will try to understand how to extend `fastExp3(x)` by polynoms, for now it's too hard for me) – Dmytro Dadyka Dec 30 '18 at 07:42
  • Your Taylor series code can evaluate its polynomial with *only* FMAs, saving all the separate `mul` intrinsics. Evaluate starting with the highest coefficient like `c0 + x * (c1*x)` so you have a single chain of FMAs building up x^n at the same time you add coefficients. Larger example: `c0 + x * (c1 + x * (c2 + x * (c3 + x*c4)))` gives us `c0 + c1*x^1 + c2*x^2 + c3*x^3 + c4*x^4`. But beware that `_mm_fma_pd` only compiles if the FMA instruction-set is enabled. You can't use it in code that needs to work with just SSE2. – Peter Cordes Dec 30 '18 at 07:54
  • `fastExp3` actually looks pretty hacky, not the kind of thing that can be extended directly. It's shifting some significand bits into the exponent, which is weird. Look at how the answer on [Fastest Implementation of Exponential Function Using SSE](https://stackoverflow.com/q/47025373) uses `_mm_cvttps_epi32` to split the input into integer part and fractional part, and stuffs the integer part into the exponent field with `_mm_slli_epi32 (i, 23)` (for single-precision). (Adding the bias to the exponent field is done as part of adding in the significant/mantissa bits.) – Peter Cordes Dec 30 '18 at 08:02
  • @PeterCordes, thanks for Taylor improve suggestion. Done. – Dmytro Dadyka Dec 30 '18 at 09:36
  • @Dmytro Dadyka: in the last graph are you showing `__m128d fastExp4`, right? – markzzz Jan 10 '19 at 16:13
  • 1
    @markzzz, thanks for remark. Yes, fastExp5 on graph corresponds to fastExp4 function. I fixed function name. – Dmytro Dadyka Jan 10 '19 at 18:44
  • @Dmytro Dadyka `fastExp5` seems the better for accuracy. Reading a comment below, its written "From experience, all these are faster and more precise than a custom padde approximation (not even talking about the unstable Taylor expansion that would give you negative number VERY quickly).". How stable is this function? – markzzz Jan 11 '19 at 14:48
  • 1
    @markzzz, Yes `fastExp5` is fast and enough accurate solution, in many cases it is optimal. Methods 1, 2 is often userful when x is fixed in some range. As for 'stability', I think it meant convergence, series expansions converge badly for large `x` (in other words it require large amount of terms), but it is not applicable for 3 or 5 methods. This methods converges in the whole range of values (until the floating point overflows) and error corresponds to linear approximation between integer powers of two (after replacing the base by multiplying by `a `) and on practise is less then 4%. – Dmytro Dadyka Jan 11 '19 at 16:32
  • @Dmytro Dadyka: right now, my "input" range is between `from -2.77259 to 2.77259` http://coliru.stacked-crooked.com/a/08fb4b551326610d - Which approx fit better (and its stable) in your opinion? How could I generate the same plots and see the errors? – markzzz Jan 12 '19 at 13:11
  • 1
    @markzzz, what error do your calculations allow? Specify the maximum or average deviation. – Dmytro Dadyka Jan 12 '19 at 15:56
  • @DmytroDadyka I'm processing audio. I believe that preserve 8 digit (i.e. 160db) would be ok – markzzz Jan 13 '19 at 13:27
  • @Dmytro Dadyka: none of those approx works man, I believe. Tried on my range (-48 to 48, with step of 0.1), calculating the drift on each value (compared to `exp()`): https://rextester.com/UDT46215. The best is fastExp2 with an error of `0.1230974105602360424427388352341949939727783203125`. Which its too much for my purpose. – markzzz Jan 15 '19 at 15:26
  • @markzzz, yes, your requirements are very high. To reach accuracy in 8 digits will not be easy. I'm not sure that exp() provides such accuracy. – Dmytro Dadyka Jan 15 '19 at 15:45
  • @Dmytro Dadyka: and some similar exp() function? Which accuracy does it got for example? Maybe I've exagerated with 8 digits :) But `0.123097` seems too much for me. That value will be a frequency multiplier, for generate waveform (i.e. `sin(freq * pitch / samplerate)`). – markzzz Jan 15 '19 at 15:53
  • You previously wrote that the range of your values is -2.77259 to 2.77259. Why are you using -48... 48 in sample? – Dmytro Dadyka Jan 15 '19 at 15:54
  • @Dmytro Dadyka: note that -48/48 is multiply by `ln2per12` (thus, `-2.77259 to 2.77259`). – markzzz Jan 15 '19 at 16:07
  • @DmytroDadyka: for example, if I use `ippsExp_64f_I` (https://software.intel.com/en-us/ipp-dev-reference-exp) forcing using SSE2, I reach a max error of `3.552713678800500929355621337890625e-15` :O Quite impressive. The downside is that I can only compute "array", not double-packed m128d values (which is what I need). – markzzz Jan 15 '19 at 16:14
  • It is interesting. I try to construct solution based on fastExp3 but with polynomial interpolation instead linear. – Dmytro Dadyka Jan 15 '19 at 16:32
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/186786/discussion-between-markzzz-and-dmytro-dadyka). – markzzz Jan 16 '19 at 13:06
5

There are several libraries that provide vectorized exponential, with more or less accuracy.

  • SVML, provided with the Intel compiler (it provides intrinsics as well, so if you have a licence, you can use them), has different level of precision (and speed)
  • you mentioned IPP, also from Intel, that also provide some functionality
  • MKL also provides some interface for this computation (for this one, fixing the ISA can be done through macros, for instance if you need reproducibility or precision)
  • fmath is another option, you can tear the code from the vectorized exp to integrate it inside your loop.

From experience, all these are faster and more precise than a custom padde approximation (not even talking about the unstable Taylor expansion that would give you negative number VERY quickly).

For SVML, IPP and MKL, I would check which one is better: calling from inside your loop or calling exp with one call for your full array (as the libraries could use AVX512 instead of just SSE2).

Matthieu Brucher
  • 19,950
  • 6
  • 30
  • 49
  • addition: Mitsunari's fmath uses a lookup table and IEEE-based rounding. This look-up table makes us only need to calculate the third order Taylor expansion which is quickly calculated without vector instructions. So fmath uses SIMD to vectorize lookup table search and `sampleIndex`-loop. – Hiroki Dec 20 '18 at 18:06
2

There is no SSE2 implementation of exp so if you don't want to roll your own as suggested above, one option is to use AVX512 instructions on some hardware that supports ERI (Exponential and Reciprocal Instructions). See https://en.wikipedia.org/wiki/AVX-512#New_instructions_in_AVX-512_exponential_and_reciprocal

I think that currently limits you to the Xeon phi (as pointed out by Peter Cordes - I did find one claim about it being on Skylake and Cannonlake but can't corroborate it), and bear in mind as well that the code won't work at all (i.e. will crash) on other architectures.

virgesmith
  • 592
  • 6
  • 14
  • 1
    Only Xeon Phi (KNL / KNM) has AVX512ER, not SKX / CannonLake. https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 – Peter Cordes Dec 20 '18 at 17:32