14

We find various tricks to replace std::sqrt (Timing Square Root) and some for std::exp (Using Faster Exponential Approximation) , but I find nothing to replace std::log.

It's part of loops in my program and its called multiple times and while exp and sqrt were optimized, Intel VTune now suggest me to optimize std::log, after that it seems that only my design choices will be limiting.

For now I use a 3rd order taylor approximation of ln(1+x) with x between -0.5 and +0.5 (90% of the case for max error of 4%) and fall back to std::log otherwise. This gave me 15% speed-up.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
user3091460
  • 287
  • 1
  • 3
  • 12
  • Two upvotes after eight minutes for a blatantly off-topic question – Lightness Races in Orbit Oct 02 '16 at 20:35
  • Ahh yes - the accuracy Vs performance question - but without stating what accuracy would be acceptable, or what was tried I don't think you'll get an 'answer' – UKMonkey Oct 02 '16 at 20:37
  • Float precision would be good enough. I tried to start from a log2 and convert back but very fast log2 are just outputting an int resulting in very poor approximation. Also tried to use the fact that ln(x) is the derivative of t->x^t in t=0 but its no good lead either for computation. – user3091460 Oct 02 '16 at 20:47
  • 2
    On modern CPUs `std::sqrt` compiles to a single instruction. It's hard to believe that you can do anything faster than that with similar accuracy. – plasmacel Oct 02 '16 at 21:00
  • @plasmacel Please have a look at the link I've put, you'll see that couple instructions maybe way faster than a single one. Enjoy the read. – user3091460 Oct 02 '16 at 21:05
  • 1
    @user3091460 If `float` precision is sufficient, why not call `logf()` from `cmath`? Or is the issue that you need the full input domain of `double`, but the result computed only to accuracy equivalent to `float` (about 6 decimal digits)? – njuffa Oct 02 '16 at 21:05
  • @njuffa I will try it and compare what the compiler emits. – user3091460 Oct 02 '16 at 21:06
  • 1
    @user3091460 Well the calculation of the error is not correct on that site. `sqrtss` is accurate to full precision, while `rsqrtss * x` followed by a single Newton-Raphson step still doesn't give full precision. – plasmacel Oct 02 '16 at 21:11
  • @user3091460 Don't forget trying relevant compiler flags as well, in addition to -O3 maybe turning off denormal support (= turning on FTZ [flush-to-zero] mode), turning on "fast math", enabling the use of a vector math library (e.g. with the Intel compiler). – njuffa Oct 02 '16 at 21:13
  • @user3091460 It's very related: http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x – plasmacel Oct 02 '16 at 21:17
  • 2
    What makes you think you're implementation's `std::log` doesn't already use the most efficient algorithm available for your system? If you're willing to sacrifice accuracy for speed (I might say something about getting wrong answers quickly), you need to say so in your question. – Keith Thompson Oct 02 '16 at 21:53
  • 1
    For now I use a3rd order taylor approximation of ln(1+x) with x between -0.5 and +0.5 (90%of the case for max error of 4%) and fall back to std::log otherwise. Gave me 15% speed-up. – user3091460 Oct 03 '16 at 08:30

5 Answers5

15

Prior to embarking on the design and deployment of a customized implementation of a transcendental function for performance, it is highly advisable to pursue optimizations at the algorithmic level, as well as through the toolchain. Unfortunately, we do not have any information about the code to be optimized here, nor do we have information on the toolchain.

At the algorithmic level, check whether all calls to transcendental functions are truly necessary. Maybe there is a mathematical transformation that requires fewer function calls, or converts transcendental functions into algebraic operation. Are any of the transcendental function calls possibly redundant, e.g. because the computation is unnecessarily switching in and out of logarithmic space? If the accuracy requirements are modest, can the whole computation be performed in single precision, using float instead of double throughout? On most hardware platforms, avoiding double computation can lead to a significant performance boost.

Compilers tend to offer a variety of switches that affect the performance of numerically intensive code. In addition to increasing the general optimization level to -O3, there is often a way to turn off denormal support, i.e. turn on flush-to-zero, or FTZ, mode. This has performance benefits on various hardware platforms. In addition, there is often a "fast math" flag whose use results in slightly reduced accuracy and eliminates overhead for handling special cases such as NaNs and infinities, plus the handling of errno. Some compilers also support auto-vectorization of code and ship with a SIMD math library, for example the Intel compiler.

A custom implementation of a logarithm function typically involves separating a binary floating-point argument x into an exponent e and a mantissa m, such that x = m * 2e, therefore log(x) = log(2) * e + log(m). m is chosen such that it is close to unity, because this provides for efficient approximations, for example log(m) = log(1+f) = log1p(f) by minimax polynomial approximation.

C++ provides the frexp() function to separate a floating-point operand into mantissa and exponent, but in practice one typically uses faster machine-specific methods that manipulate floating-point data at the bit level by re-interpreting them as same-size integers. The code below for the single-precision logarithm, logf(), demonstrates both variants. Functions __int_as_float() and __float_as_int() provide for the reinterpretation of an int32_t into an IEEE-754 binary32 floating-point number and vice-versa. This code heavily relies on the fused multiply-add operation FMA supported directly in the hardware on most current processors, CPU or GPU. On platforms where fmaf() maps to software emulation, this code will be unacceptably slow.

#include <cmath>
#include <cstdint>
#include <cstring>

float __int_as_float (int32_t a) { float r; memcpy (&r, &a, sizeof r); return r;}
int32_t __float_as_int (float a) { int32_t r; memcpy (&r, &a, sizeof r); return r;}

/* compute natural logarithm, maximum error 0.85089 ulps */
float my_logf (float a)
{
    float i, m, r, s, t;
    int e;

#if PORTABLE
    m = frexpf (a, &e);
    if (m < 0.666666667f) {
        m = m + m;
        e = e - 1;
    }
    i = (float)e;
#else // PORTABLE
    i = 0.0f;
    if (a < 1.175494351e-38f){ // 0x1.0p-126
        a = a * 8388608.0f; // 0x1.0p+23
        i = -23.0f;
    }
    e = (__float_as_int (a) - __float_as_int (0.666666667f)) & 0xff800000;
    m = __int_as_float (__float_as_int (a) - e);
    i = fmaf ((float)e, 1.19209290e-7f, i); // 0x1.0p-23
#endif // PORTABLE
    /* m in [2/3, 4/3] */
    m = m - 1.0f;
    s = m * m;
    /* Compute log1p(m) for m in [-1/3, 1/3] */
    r =             -0.130310059f;  // -0x1.0ae000p-3
    t =              0.140869141f;  //  0x1.208000p-3
    r = fmaf (r, s, -0.121483512f); // -0x1.f198b2p-4
    t = fmaf (t, s,  0.139814854f); //  0x1.1e5740p-3
    r = fmaf (r, s, -0.166846126f); // -0x1.55b36cp-3
    t = fmaf (t, s,  0.200120345f); //  0x1.99d8b2p-3
    r = fmaf (r, s, -0.249996200f); // -0x1.fffe02p-3
    r = fmaf (t, m, r);
    r = fmaf (r, m,  0.333331972f); //  0x1.5554fap-2
    r = fmaf (r, m, -0.500000000f); // -0x1.000000p-1  
    r = fmaf (r, s, m);
    r = fmaf (i,  0.693147182f, r); //  0x1.62e430p-1 // log(2)
    if (!((a > 0.0f) && (a < INFINITY))) {
        r = a + a;  // silence NaNs if necessary
        if (a  < 0.0f) r = INFINITY - INFINITY; //  NaN
        if (a == 0.0f) r = -INFINITY;
    }
    return r;
}

As noted in the code comment, the implementation above provides faithfully-rounded single-precision results, and it deals with exceptional cases consistent with the IEEE-754 floating-point standard. The performance can be increased further by eliminating special-case support, eliminating the support for denormal arguments, and reducing the accuracy. This leads to the following exemplary variant:

/* natural log on [0x1.f7a5ecp-127, 0x1.fffffep127]. Maximum relative error 9.4529e-5 */
float my_faster_logf (float a)
{
    float m, r, s, t, i, f;
    int32_t e;

    e = (__float_as_int (a) - 0x3f2aaaab) & 0xff800000;
    m = __int_as_float (__float_as_int (a) - e);
    i = (float)e * 1.19209290e-7f; // 0x1.0p-23
    /* m in [2/3, 4/3] */
    f = m - 1.0f;
    s = f * f;
    /* Compute log1p(f) for f in [-1/3, 1/3] */
    r = fmaf (0.230836749f, f, -0.279208571f); // 0x1.d8c0f0p-3, -0x1.1de8dap-2
    t = fmaf (0.331826031f, f, -0.498910338f); // 0x1.53ca34p-2, -0x1.fee25ap-2
    r = fmaf (r, s, t);
    r = fmaf (r, s, f);
    r = fmaf (i, 0.693147182f, r); // 0x1.62e430p-1 // log(2) 
    return r;
}
njuffa
  • 19,228
  • 3
  • 59
  • 102
  • 1
    Thks for that, however i cant find int_as_float and float_as_int using Msvc 15 on win10. I found that its part of cuda but didnt download the full package. – user3091460 Oct 03 '16 at 08:27
  • 1
    @user3091460 These functions are *abstractions* of machine-specific functionality. As a first step, you can simply use `memcpy()`, e.g. `float __int_as_float(int32_t a) { float r; memcpy (&r, &a, sizeof(r)); return r;}` A good compiler will likely optimize this appropriately, but depending on the hardware you are targeting (which you haven't disclosed), there may be better ways, possibly involving intrinsics or inline assembly. – njuffa Oct 03 '16 at 08:59
  • @user3091460 and njuffa: the optimal asm for x86 will probably do any manipulation of floats as integers using SSE2 integer instructions, because the XMM registers are used for both scalar/vector floats and vector ints. So you should probably `_mm_set_ss(your_float)` and `_mm_castps_si128` that to get a `__m128i` that you can manipulate. (That may waste an instruction zeroing the high bits of the xmm register, [because of design limitations in intrinsics.](http://stackoverflow.com/q/39318496/224132)). A MOVD to get the float bits to/from an integer register might be good, too. – Peter Cordes Oct 03 '16 at 09:15
  • @PeterCordes Understood. I wasn't going to invest the considerable work needed for a turn-key SIMD-intrinsic solution, in particular given that it is still unclear what ISA extensions are available on the asker's hardware. Consider posting your own version using SIMD intrinsics, and I'd be happy to upvote. – njuffa Oct 03 '16 at 09:18
  • I'll just go as far as an efficient float_to_int that uses a union to type-pun, and compiles to a single `movd eax, xmm0` with clang and gcc for x86. https://godbolt.org/g/UCePpA. It's just as easy as you'd hope it would be, @user3091460 :) Manipulating the integer as an `uint32_t` might even be more efficient, since integer instructions are shorter, and on Haswell can run on port6 (which doesn't have any vector ALUs). But probably it would be better to stay in XMM registers, since you're not going very much in integer. – Peter Cordes Oct 03 '16 at 09:24
  • @PeterCordes Note that the version based on memcpy() also works fine, and avoids undefined behavior (AFAIK type punning through unions is not sanctioned by the C++ standard): https://godbolt.org/g/VXo7gj – njuffa Oct 03 '16 at 16:38
  • @njuffa: Oh right, this is a C++ question, [not C](http://stackoverflow.com/questions/11639947/is-type-punning-through-a-union-unspecified-in-c99-and-has-it-become-specified). Compilers are pretty good at optimizing away the memcpy, but I suspect that in more complex cases you might still get an actual store/reload. GNU C++ guarantees that union-based type punning works like it does in C. I think most other C++ compilers guarantee the same thing, but I guess if you get good results from the fully portable way on the target you care about, then go with that. – Peter Cordes Oct 03 '16 at 17:27
  • How did you generate the polynomials for these? – gct Jan 04 '19 at 03:00
  • @SeanMcAllister I generate minimax approximation using the [Remez algorithm](https://en.wikipedia.org/wiki/Remez_algorithm), often followed by refinement based on heuristic searches. – njuffa Jan 04 '19 at 18:36
  • What do you use to generate the approximation? Sollya? Mathematica wasn't great at it. Can you explain your trick for getting a float into a [-1/3,+1/3] range too? – gct Jan 05 '19 at 04:01
  • @SeanMcAllister I use my own software based on Remez. The Sollya tool should work well (I haven't used it myself but worked with a colleague who did). `0x3f2aaaab` is the binary representation of 2/3 in IEEE-754 single precision (I probably should have added a comment there), so you can modify this to reduce to other ranges by adjusting the constant. The rest is straightforward bit manipulation to separate exponent and mantissa (`0xff800000` covers sign and exponent fields). – njuffa Jan 05 '19 at 04:09
  • @SeanMcAllister So `0x3f2aaaab` is used to reduce to [2/3, 4/3]; it is the lower bound of the interval. Other codes you may encounter will reduce to [sqrt(2)/2, sqrt(2)] instead, in which case one would use `0x3f3504f3`. I have also experimented with reduction to [3/4, 3/2], in which case the constant is `0x3f400000` (constants with only few bits are more efficient on some architectures). – njuffa Jan 05 '19 at 04:31
  • @njuffa What could be the reason why I get my_logf()*0.4342944819f 2x faster than std::log10() but, std::log10() over x42 faster than my_faster_logf()*0.4342944819f (Ubuntu/latest GCC in use)? My CPU is an old i5-750. Also, if I would input only in range [0,1] to make linear to dB conversion, could your functions be even faster... but, what all changes should be done to achieve sufficient accuracy of 0.001dB in conversion? – Juha P Mar 26 '21 at 13:38
  • @JuhaP Sorry, I don't understand what your are asking about. This is a Q&A site, not a discussion forum. Consider asking a site-appropriate question if a search for potential duplicates on this side comes up empty. – njuffa Mar 26 '21 at 15:28
  • Thanks for the quick reply. OK, forget the latter question, my main question was why those your two logf() functions have such huge difference in performance (in your text the simplified function my_faster_logf() should increase performance but, for me, its hundred times slower than your my_logf() function is)? – Juha P Mar 26 '21 at 16:44
  • @JuhaP If you compile with the same compiler for the same platform using the same compilation switches, `my_faster_logf()` should be about twice the speed of `my_logf()`. If your platform does not support fused multiply-add (FMA) operations in hardware, calls to `fmaf()` will be very slow, since the functionality will need to be emulated. – njuffa Mar 26 '21 at 17:30
  • @njuffa That also was my thought but, in practice, it isn't the situation. I use exact same switches, even compare against each other (test I'm using results value of 421.346 for my_faster_logf() against 4.17925 for my_logf() (std::log10() gets result of ~9.7) so, my_faster_logf() does not work well (at least in performance testing situation). Here's link for the test routines I'm using: https://www.kvraudio.com/forum/viewtopic.php?p=7170633#p7170633 – Juha P Mar 26 '21 at 18:03
  • @JuhaP Core i5-750 (Lynnfield, ca. 2010) does **not** have FMA in hardware. So I would expect `fmaf()` to be very slow on this processor. The FMA operation was added with Haswell in 2013. Sorry, I am not going to debug for you. You may need to look at the generated code. I just confirmed, again (different compiler and different CPU than when I originally posted this code), that `my_faster_logf()` runs at about twice the speed of `my_logf()` [`PORTABLE = 0`] . – njuffa Mar 26 '21 at 18:15
  • OK, thanks. I'll try to debug the generated code. – Juha P Mar 26 '21 at 19:16
  • @JuhaP For x86-64 platforms, I see around 20 instructions generated for `my_faster_logf()` versus about 45 instructions for the fastpath of `my_logf()`. – njuffa Mar 26 '21 at 19:20
  • @njuffa OK, now that you took all my_logf() functionality out of the "if ((a > 0.0f) && (a <= 3.40282347e+38f)) { // 0x1.fffffep+127 } else { // silence NaNs }" structure, this approximation supports emulated mode. Now I get 1023.46 for the my_logf() approximation ... which I believe is OK result compared to 421.346 for my_faster_logf() ... and this also proves that fmaf() is emulated here. – Juha P Mar 26 '21 at 21:03
  • @JuhaP The code update today merely moved the special case handling to the end, and the generated assembly code differs *minimally* from that generated for the previous version of the code. On my machine here (Intel Xeon W 2133; Skylake-W) I see approximate throughput of one every 2.3ns for `my_faster_logf()` and one every 5.8ns for `my_logf()`, and MSVC compiler's built-in `logf()` at one every 4.8ns. For fast `logf` on [0,1), I would suggest unconditionally multiplying the input of `my_faster_logf()` by `16777216` (= 2**24) and subtracting `16.63553233f` from its result. – njuffa Mar 26 '21 at 21:29
4

Take a look at this discussion, the accepted answer refers to an implementation of a function for computing logarithms based on the Zeckendorf decomposition.

In the comments in the implementation file there is a discussion about complexity and some tricks to reach O(1).

Hope this helps!

Community
  • 1
  • 1
Mo Abdul-Hameed
  • 5,516
  • 2
  • 21
  • 32
1
#include <math.h>
#include <iostream>

constexpr int LogPrecisionLevel = 14;
constexpr int LogTableSize = 1 << LogPrecisionLevel;

double log_table[LogTableSize];

void init_log_table() {
    for (int i = 0; i < LogTableSize; i++) {
        log_table[i] = log2(1 + (double)i / LogTableSize);
    }
}

double fast_log2(double x) { // x>0
    long long t = *(long long*)&x;
    int exp = (t >> 52) - 0x3ff;
    int mantissa = (t >> (52 - LogPrecisionLevel)) & (LogTableSize - 1);
    return exp + log_table[mantissa];
}

int main() {
    init_log_table();

    double d1 = log2(100); //6.6438561897747244
    double d2 = fast_log2(100); //6.6438561897747244
    double d3 = log2(0.01); //-6.6438561897747244
    double d4 = fast_log2(0.01); //-6.6438919626096089
}
Saar Ibuki
  • 31
  • 2
  • Your type-punning violates strict-aliasing. (Use `memcpy` instead of pointer-casting. Also you should probably use `unsigned long long` because you don't need an arithmetic shift. Doesn't matter for correctness on a 2's complement machine, but still.) This also requires that integer endian matches float endian, like on x86, so you should at least document that. – Peter Cordes Sep 04 '18 at 19:20
  • Some text to explain the table-lookup strategy and actual relative / absolute precision over some range of inputs, and limitations like what happens for 0 or negative inputs, would be a good idea. – Peter Cordes Sep 04 '18 at 19:21
  • Your table only needs to be `float`. That would cut your cache footprint in half. (But a 2^14 * 4-byte table is still 64kiB. You will get lots of cache misses in most use-cases, which is why most fast-log implementations use a polynomial approximation on modern CPUs, not a table lookup. Especially when you can use SIMD: [Efficient implementation of log2(\_\_m256d) in AVX2](https://stackoverflow.com/q/45770089)) – Peter Cordes Sep 04 '18 at 19:30
  • Peter, sorry for commenting on a very old answer, but is the strict-aliasing rule really being broken here? I guess you're referring to the very first line in the fast_log2 function. I'd assume there's really no aliasing here and that the value of x is being copied, reinterpreted as a long long (so very similar behavior to memcpy). Unless I'm missing something, t is not an alias, right? – Asher May 26 '21 at 22:16
0

I vectorized the @njuffa's answer. natural log, works with AVX2:

inline __m256 mm256_fmaf(__m256 a, __m256 b, __m256 c){
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}

//https://stackoverflow.com/a/39822314/9007125
//https://stackoverflow.com/a/65537754/9007125
// vectorized version of the answer by njuffa
/* natural log on [0x1.f7a5ecp-127, 0x1.fffffep127]. Maximum relative error 9.4529e-5 */
inline __m256 fast_log_sse(__m256 a){

    __m256i aInt = *(__m256i*)(&a);
    __m256i e =    _mm256_sub_epi32( aInt,  _mm256_set1_epi32(0x3f2aaaab));
            e =    _mm256_and_si256( e,  _mm256_set1_epi32(0xff800000) );
        
    __m256i subtr =  _mm256_sub_epi32(aInt, e);
    __m256 m =  *(__m256*)&subtr;

    __m256 i =  _mm256_mul_ps( _mm256_cvtepi32_ps(e), _mm256_set1_ps(1.19209290e-7f));// 0x1.0p-23
    /* m in [2/3, 4/3] */
    __m256 f =  _mm256_sub_ps( m,  _mm256_set1_ps(1.0f) );
    __m256 s =  _mm256_mul_ps(f, f); 
    /* Compute log1p(f) for f in [-1/3, 1/3] */
    __m256 r =  mm256_fmaf( _mm256_set1_ps(0.230836749f),  f,  _mm256_set1_ps(-0.279208571f) );// 0x1.d8c0f0p-3, -0x1.1de8dap-2
    __m256 t =  mm256_fmaf( _mm256_set1_ps(0.331826031f),  f,  _mm256_set1_ps(-0.498910338f) );// 0x1.53ca34p-2, -0x1.fee25ap-2

           r =  mm256_fmaf(r, s, t);
           r =  mm256_fmaf(r, s, f);
           r =  mm256_fmaf(i, _mm256_set1_ps(0.693147182f),  r);  // 0x1.62e430p-1 // log(2)
    return r;
}
Kari
  • 956
  • 7
  • 21
  • Note that your `mm256_fmaf` can compile as separate mul and add operations, with rounding of the intermediate product. It's *not* guaranteed to be an FMA. (And only some compilers, like GCC, will "contract" it into an FMA instruction for you, when the target supports FMA like most AVX2 machines do (not quite all: one VIA design). Probably best to just target AVX2+FMA3 and use `_mm256_fmadd_ps`, maybe with an optional fallback if you want, but not a misleadingly named and maybe slower `fma` function by default. – Peter Cordes Jan 03 '21 at 13:30
-3

It depends how accurate you need to be. Often log is called to get an idea of the magnitude of the number, which you can do essentially for free by examining the exponent field of a floating point number. That's also your first approximation. I'll put in a plug for my book "Basic Algorithms" which explains how to implement the standard library math functions from first principles.

Malcolm McLean
  • 6,063
  • 1
  • 13
  • 18