2

The task is to set each element of a C integer array to its absolute value. I'm trying to do it as efficiently as possible. Below are a progression of optimizations that I've made. Please tell me if these are actually optimizations at all, and if any more can be made!

The first parameter of the function will be an integer array, and the second will be an integer size of that array.

Here's the standard implementation:

void absolute (int array[], int n){
  for(int i = 0; i < n; i++)
    if(array[i] < 0)
      array[i] = - array[i];
}

That's plenty to satisfy any introductory programming course professor, but I'd like to play with it a little more and maybe learn something about optimization along the way.

Based on https://stackoverflow.com/a/2074403, a branchless absolute value:

void absolute (int array[], int n){
  for(int i = 0; i < n; i++){
    uint32_t temp = array[i] >> 31;     // make a mask of the sign bit
    array[i] ^= temp;                   // toggle the bits if value is negative
    array[i] += temp & 1;               // add one if value was negative
  }
}

Based on comparisons to zero being more efficient, and not needing an extra variable sitting around:

void absolute (int array[], int n){
  for(n--; n >= 0;){
    uint32_t temp = array[n] >> 31;
    array[n] ^= temp;
    array[n] += temp & 1;
  }
}

(does this vectorize anymore though?)

That's as far as I got. Can more be done to optimize this function?

Community
  • 1
  • 1
oink
  • 1,243
  • 2
  • 12
  • 22

2 Answers2

4

Personally I rather like this question. It's questions like these that get you wondering if perhaps there is a way to make my own code better.

Your final optimization is incorrect as it initializes n--, but n is never decremented again. To correct this you need for(n--; n >= 0; n--). Though the results I had found that decrementing or incrementing your for loop contained no noticeable advantage.

If the values of the array are not truly randomly distributed I found that the simple if(array[i] < 0) used in the first implementation is actually significantly faster.

Here's the code I used to benchmark:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <stdint.h>
#ifdef _OPT3
#include <emmintrin.h>
#include <tmmintrin.h>
#endif

int main(int argc, char **argv)
{
        int *array;
        struct timespec tsstart, tsend;
        int ncount = 500000000;
        int i;

        array = malloc(sizeof(int) * ncount);

        for(i = 0; i < ncount; i++)
        {
                array[i] = rand();
#ifdef _DIST
                if(rand() % 100 == 0) // make the values less likely to be negative.
#else
                if(rand() % 2 == 0) // the values are equeally likely to be negaitve as positive.
#endif
                        array[i] = -rand();
        }

        clock_gettime(CLOCK_MONOTONIC, &tsstart);

#ifdef _OPT1
        for(i = 0; i < ncount; i++)
        {
                uint32_t ntemp = array[i] >> 31;
                array[i] ^= ntemp;
                array[i] += ntemp & 1;
        }
#elif _OPT2
        for(ncount--; ncount >= 0; ncount--)
        {
                uint32_t ntemp = array[ncount] >> 31;
                array[ncount] ^= ntemp;
                array[ncount] += ntemp & 1;
        }
#elif _OPT3
        for(i = 0; i < ncount; i+=4)
        {
                __m128i a3_a2_a1_a0 = _mm_loadu_si128((__m128i*)&array[i]);         //Load 4 int32 elements from array.
                a3_a2_a1_a0 = _mm_abs_epi32(a3_a2_a1_a0);                           //Set absolute of 4 int32 elements in single instruction.
                _mm_storeu_si128((__m128i*)(&array[i]), a3_a2_a1_a0);               //Store 4 int32 elements of array.
        }
#elif _OPT4
        for(i = 0; i < ncount; i++)
        {
                array[i] = abs(array[i]); // abs() is actually an intrinsic on gcc and msvc
        }
#else
        for(i = 0; i < ncount; i++)
        {
                if(array[i] < 0)
                {
                        array[i] = -array[i];
                }
        }
#endif

        clock_gettime(CLOCK_MONOTONIC, &tsend);

        printf("start: %ld.%09ld\n", tsstart.tv_sec, tsstart.tv_nsec);
        printf("end: %ld.%09ld\n", tsend.tv_sec, tsend.tv_nsec);

        tsend.tv_sec -= tsstart.tv_sec;
        tsend.tv_nsec -= tsstart.tv_nsec;
        if(tsend.tv_nsec < 0)
        {
                tsend.tv_sec--;
                tsend.tv_nsec = 1000000000 + tsend.tv_nsec;
        }
        printf("diff: %ld.%09ld\n", tsend.tv_sec, tsend.tv_nsec);

        free(array);

        return 0;
}

Test Results

Here's my results (the times are in seconds). These tests were run on Intel(R) Xeon(R) CPU W3580 @ 3.33GHz. gcc (Debian 4.9.2-10) 4.9.2

// Implimentation One (No Optimizations)
$ gcc -O3 -march=native test.c
$ ./a.out
start: 9221396.418007954
end: 9221398.103490309
diff: 1.685482355

// Implimentation One Non Random Distrubution
$ gcc -D_DIST -O3 -march=native test.c
$ ./a.out
start: 9221515.889463124
end: 9221516.255742919
diff: 0.366279795

// Implementation Two (Branchless)
$ gcc -D_OPT1 -O3 -march=native test.c
$ ./a.out
start: 9221472.539690988
end: 9221472.787347636
diff: 0.247656648

// Implementation Three (Branchless Decrement)
$ gcc -D_OPT2 -O3 -march=native test.c
$ ./a.out
start: 9221930.068693139
end: 9221930.334575475
diff: 0.265882336

// Rotem's Implementation (SIMD)
$ gcc -D_OPT3 -O3 -march=native test.c
$ ./a.out
start: 9222076.001094679
end: 9222076.230432423
diff: 0.229337744

// Inuitive abs() Implementation
$ gcc -D_OPT4 -O3 -march=native test.c
$ ./a.out
start: 9222112.523690484
end: 9222112.754820240
diff: 0.231129756
// Inuitive abs() Implementation Without native
$ gcc -D_OPT4 -O3 test.c
$ ./a.out
start: 9223301.744006196
end: 9223301.974097927
diff: 0.230091731

Conclusion

My take away from this is that hardware optimizations to handle branch predictions might significantly speed up code execution and improve your speed better than any software based optimizations. By trying to optimize out the branching, you've created code that executes the same steps no matter the data being processed. So while it executes in constant time, if the data is not perfect randomly distributed you might actually be making the execution speed slower.

Update: I did some tests with compiler optimizations turned on and found different results that don't entirely support the conclusions I came to earlier.

In my experience I have found that if you can simply write less code, that is normally the best way of optimization. It seems the less instructions, the faster it executes regardless of hardware features.

I look forward to reading any comments on this exercise.

Update

I've added the results of Rotem's implementation. This code is super fast and proves that the less instructions you have, the faster the execution time. Nice work Rotem!

Update 2

I did some extensive testing today and found that micro optimizations like changing the way the for loop counts has absolutely no effect when compiler optimizations like gcc -O3 are turned on. The compiler ends up generating assembly that does pointer comparison on the array pointer to test if we've reached the end.

Optimizing the SSE code provided by Rotem also makes no difference when the compiler is run with gcc -O3 as it properly aligns the memory on a 16 Byte boundary which removes the _mm_loadu_si128()/_mm_storeu_si128() necessity.

Final Update

I added another implementation which is using the simple and intuitive abs() function. It turns out abs() on gcc and MSVC is actually a compiler intrinsic. I redid all the test results simply using gcc -O3 optimizations.

As you can see the Rotem's SIMD implementaiton and the abs() implementation are the fastest, followed by the two XOR implementations, and finally the branching implementations.

Of the two XOR implementations the one that decrements the for loop is actually slightly slower as it's loop contains 14 instructions whereas the increment loop contains only 13.

Rotem's SIMD implementation and the abs() implementation actually both rely on the PABSD instruction and both have loops containing 7 instructions. The slight difference in speed (SIMD being slightly faster) however comes from the fact that the optimized SIMD implementation assumes the memory will always contain multiples of 4 integers (128bits) whereas the abs() implementation requires extra instructions to test the cases where the memory does not contain multiples of 4 integers.

What's amazing here is that by simply using abs() we can achieve almost exactly the same speed as SIMD with the simplicity of calling a C library function. The abs() loop without using -march=native is only 4 instructions longer, which instead of utilizing PABSD, it uses PSRAD, PXOR, and PSUBD instructions.

Why is portable abs() faster than the XOR implementation?

It turns out the portable (or non-native) abs() assembly is nearly exactly that of the XOR implementation.

Here's the abs():

psrad   $31, %xmm0
pxor    %xmm0, %xmm1
psubd   %xmm0, %xmm1

Here's the XOR:

psrad   $31, %xmm1
movdqa  %xmm1, %xmm2
pxor    %xmm1, %xmm0
pand    %xmm3, %xmm2
paddd   %xmm2, %xmm0

Now lets convert these back into C code:

Here's the abs():

int ntemp = array[i] >> 31;
array[i] ^= ntemp;
array[i] -= ntemp;

Here's the XOR:

uint32_t ntemp = array[i] >> 31;
array[i] ^= ntemp;
array[i] += ntemp & 1;

The difference is in the fact that we have an extra bitwise AND operation in the original XOR implementation.

Final Conclusion

Use abs()! :D

Brandon
  • 416
  • 2
  • 5
3

For best performance I suggest you use SIMD instructions.
Different processors supports different sets of SIMD instructions.

The common way for using manual SIMD instructions optimization is by C intrinsic functions.

The following example uses SSE intrinsic:

#include <intrin.h>

//Limitations:
//1. n must be a multiple of 4.
void absolute(const int array[], int n)
{
    int x;

    //Process 4 elements per iteration.
    for (x = 0; x < n; x += 4)
    {  
        __m128i a3_a2_a1_a0 = _mm_loadu_si128((__m128i*)&array[x]);     //Load 4 int32 elements from array.

        a3_a2_a1_a0 = _mm_abs_epi32(a3_a2_a1_a0);                       //Set absolute of 4 int32 elements in single instruction.

        _mm_storeu_si128((__m128i*)(&array[x]), a3_a2_a1_a0);           //Store 4 int32 elements of array.
    }
}

Consider: this is just an example (not the best of the best performance).

Thank you Brandon for measuring my code sample.

Rotem
  • 13,441
  • 4
  • 19
  • 44