4

I have a function in this form (From Fastest Implementation of Exponential Function Using SSE):

__m128 FastExpSse(__m128 x)
{
    static __m128  const a   = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
    static __m128i const b   = _mm_set1_epi32(127 * (1 << 23) - 486411);
    static __m128  const m87 = _mm_set1_ps(-87);
    // fast exponential function, x should be in [-87, 87]
    __m128 mask = _mm_cmpge_ps(x, m87);

    __m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
    return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}

I want to make it C compatible.
Yet the compiler doesn't accept the form static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411); when I use C compiler.

Yet I don't want the first 3 values to be recalculated in each function call.
One solution is to inline it (But sometimes the compilers reject that).

Is there a C style to achieve it in case the function isn't inlined?

Thank You.

Royi
  • 4,153
  • 5
  • 36
  • 53

3 Answers3

4

Remove static and const.

Also remove them from the C++ version. const is OK, but static is horrible, introducing guard variables that are checked every time, and a very expensive initialization the first time.

__m128 a = _mm_set1_ps(12102203.2f); is not a function call, it's just a way to express a vector constant. No time can be saved by "doing it only once" - it normally happens zero times, with the constant vector being prepared in the data segment of the program and simply being loaded at runtime, without the junk around it that static introduces.

Check the asm to be sure, without static this is what happens: (from godbolt)

FastExpSse(float __vector(4)):
        movaps  xmm1, XMMWORD PTR .LC0[rip]
        cmpleps xmm1, xmm0
        mulps   xmm0, XMMWORD PTR .LC1[rip]
        cvtps2dq        xmm0, xmm0
        paddd   xmm0, XMMWORD PTR .LC2[rip]
        andps   xmm0, xmm1
        ret
.LC0:
        .long   3266183168
        .long   3266183168
        .long   3266183168
        .long   3266183168
.LC1:
        .long   1262004795
        .long   1262004795
        .long   1262004795
        .long   1262004795
.LC2:
        .long   1064866805
        .long   1064866805
        .long   1064866805
        .long   1064866805
harold
  • 53,069
  • 5
  • 75
  • 140
  • Are you sure that you do not confuse with the __thread keyword ? Static and const should not add any more code. It should produce the same code that you shown – benjarobin Sep 02 '18 at 18:24
  • Oh, I see, this is because you build the code with a C++ compiler, I was talking about C code. And you are right, in C++ code adding static add "complex code" – benjarobin Sep 02 '18 at 18:28
  • @benjarobin well that part was about C++. For C, marking the vector constants static is not even possible. – harold Sep 02 '18 at 18:29
  • @harold, You say that removing both in `C` will yield a code with zero overhead for that part? Is there a reason why the people answered the linked question used it? +1. – Royi Sep 02 '18 at 18:34
  • 1
    @Royi yes, no real overhead. Though if inlined, the loads of the constants may be put before a loop, that sort of thing. Anyway I don't see any static variables in the linked answers so IDK – harold Sep 02 '18 at 18:39
  • @benjarobin: `static const` *shouldn't* add anything, but unlike `static const int table[4] = {1,2,3,4};`, compilers suck and somehow fail to optimize away `_mm_set` intrinsics with compile-time-constant values. Harold is right: let the compiler handle vector constants the same way it handles string literals like when you write `char *foo = "hello";`, by possibly deduplicating across functions and whatnot. – Peter Cordes Sep 02 '18 at 18:55
  • Let's say it was inlined, is there anything to do to optimize it? For instance, as you said, if it inside a loop not to make the same operation over and over. – Royi Sep 02 '18 at 18:56
  • @Royi: Compilers know how to hoist constant-setup out of loops after inlining, even MSVC these days. – Peter Cordes Sep 02 '18 at 18:57
  • @Royi: "*why the people answered the linked question used it?*" I don't see any use of `static` variables in [Fastest Implementation of Exponential Function Using SSE](https://stackoverflow.com/q/47025373), only `const` (which is just a compile time restriction on what the function can do, and doesn't mean anything as far as where it's stored.) – Peter Cordes Sep 02 '18 at 19:00
  • @PeterCordes I was agreed with harold's answer and I learned a bit from him (thank you), sorry for the misunderstanding – benjarobin Sep 02 '18 at 19:03
4

_mm_set1_ps(-87); or any other _mm_set intrinsic is not a valid static initializer with current compilers, because it's not treated as a constant expression.

In C++, it compiles to runtime initialization of the static storage location (copying from a vector literal somewhere else). And if it's a static __m128 inside a function, there's a guard variable to protect it.

In C, it simply refuses to compile, because C doesn't support non-constant initializers / constructors. _mm_set is not like a braced initializer for the underlying GNU C native vector, like @benjarobin's answer shows.


This is really dumb, and seems to be a missed-optimization in all 4 mainstream x86 C++ compilers (gcc/clang/ICC/MSVC). Even if it somehow matters that each static const __m128 var have a distinct address, the compiler could achieve that by using initialized read-only storage instead of copying at runtime.

So it seems like constant propagation fails to go all the way to turning _mm_set into a constant initializer even when optimization is enabled.


Never use static const __m128 var = _mm_set... even in C++; it's inefficient.

Inside a function is even worse, but global scope is still bad.

Instead, avoid static. You can still use const to stop yourself from accidentally assigning something else, and to tell human readers that it's a constant. Without static, it has no effect on where/how your variable is stored. const on automatic storage just does compile-time checking that you don't modify the object.

const __m128 var = _mm_set1_ps(-87);    // not static

Compilers are good at this, and will optimize the case where multiple functions use the same vector constant, the same way they de-duplicate string literals and put them in read-only memory.

Defining constants this way inside small helper functions is fine: compilers will hoist the constant-setup out of a loop after inlining the function.

It also lets compilers optimize away the full 16 bytes of storage, and load it with vbroadcastss xmm0, dword [mem], or stuff like that.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
2

This solution is clearly not portable, it's working with GCC 8 (only tested with this compiler):

#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>

#define INIT_M128(vFloat)  {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32)   {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}

static void print128(const void *p)
{
    unsigned char buf[16];

    memcpy(buf, p, 16);
    for (int i = 0; i < 16; ++i)
    {
        printf("%02X ", buf[i]);
    }
    printf("\n");
}

int main(void)
{
    static __m128  const glob_a = INIT_M128(12102203.2f);
    static __m128i const glob_b = INIT_M128I(127 * (1 << 23) - 486411);
    static __m128  const glob_m87 = INIT_M128(-87.0f);

    __m128  a   = _mm_set1_ps(12102203.2f); 
    __m128i b   = _mm_set1_epi32(127 * (1 << 23) - 486411);
    __m128  m87 = _mm_set1_ps(-87);

    print128(&a);
    print128(&glob_a);

    print128(&b);
    print128(&glob_b);

    print128(&m87);
    print128(&glob_m87);

    return 0;
}

As explained in the answer of @harold (in C only), the following code (build with or without WITHSTATIC) produces exactly the same code.

#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>

#define INIT_M128(vFloat)  {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32)   {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}

__m128 FastExpSse2(__m128 x)
{
#ifdef WITHSTATIC
    static __m128  const a = INIT_M128(12102203.2f);
    static __m128i const b = INIT_M128I(127 * (1 << 23) - 486411);
    static __m128  const m87 = INIT_M128(-87.0f);
#else
    __m128  a   = _mm_set1_ps(12102203.2f);
    __m128i b   = _mm_set1_epi32(127 * (1 << 23) - 486411);
    __m128  m87 = _mm_set1_ps(-87);
#endif

    __m128 mask = _mm_cmpge_ps(x, m87);

    __m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
    return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}

So in summary it's better to remove static and const keywords (better and simpler code in C++, and in C the code is portable since with my proposed hack the code is not really portable)

benjarobin
  • 4,114
  • 23
  • 21