Precision loss when vectorising with Intel Intrinsics

Question

I have vectorised the following single instruction stream, single data stream(SISD) code with Intel's SSE intrinsics libraries(versions 1->4.2),

  float routine(float * restrict a, float * restrict b, int size)
  {
    float sum = 0.0;
    for ( int i = 0; i < size; i++ )
    {
      sum = sum + a[i] * b[i];
    }
    return sum;
  }

into the following single instruction, multiple data(SIMD) code,

  float vectorised_routine(float * restrict a, float * restrict b, int size)
  {
    float sum= 0.0;
    __m128 vSum=  _mm_setzero_ps(); //declare and initialise our vector vars to zero
    __m128 vTemp=  _mm_setzero_ps();
    __m128 vTemp2=  _mm_setzero_ps();
    int vecOpCount= size - (size % VECTOR_LEN); //number of vector ops to be performed
    int i= 0;
    for (; i < vecOpCount; i+= VECTOR_LEN) //VECTOR_LEN= 4
    {
      vTemp= _mm_load_ps (&a[i]);
      vTemp2= _mm_load_ps (&b[i]);
      vTemp= _mm_mul_ps (vTemp2, vTemp);
      vSum= _mm_add_ps (vSum, vTemp);
    }
    int leftover= size - i; //Remaining ops. Always less then VECTOR_LEN
    for(int j= 0; j < leftover; j++)
    {
      vTemp= _mm_load_ss (&a[i+j]);
      vTemp2= _mm_load_ss (&b[i+j]);
      vTemp= _mm_mul_ps (vTemp2, vTemp);
      vSum= _mm_add_ps (vSum, vTemp);
    }
    float* storeBay= malloc(sizeof(float)*VECTOR_LEN);
    _mm_store_ps (storeBay, vSum);
    for(int j= 0; j < VECTOR_LEN; j++)
    {
      sum+= storeBay[j];
    }
    return sum;
  }

Comparing the sums returned by my SISD and SIMD code, there is no difference for the vast majority of my widely-varying sample inputs. When there is a difference, it is an integer(albeit in float form of course) and it is rarely larger then 10.

Seeing as the disparity between my sums is usually quite small, I am led to believe this could be an issue relating to precision. However, that the disparity is always an integer leads me to believe it is not a precision related issue, or if so, it is in an indirect manner.

Do you know why my routines sometimes return different values?

One thing to realize is that the order in which the products are added might not be the same. Maybe this matters? Have you considered differences in the rounding mode? — Jupiter, May 03 '20 at 21:15
Floating-point addition is not associative. Also, you're leaking memory (exactly why you decided to allocate is unclear, as is why you hand-write intrinsics for the scalar residue after the vectorized part, or why you zero-initialize locals that are always initialized by loading form memory, and which should be declared in the innermost scope instead). — EOF, May 03 '20 at 21:34
How large are your elements, how many do you add and how large is the sum, typically? An absolute error of `10` compared to a sum of `1e6` will happen almost certainly due to the different order of operations. If your sum was something like `1000`, an error of 10 would be a lot. — chtz, May 03 '20 at 21:54
Unrelated, but `malloc(sizeof(float)*VECTOR_LEN);` is insane. Use `alignas(16) float storeBay[4];` - a local 16-byte array is 100% fine and much better than leaking a dynamic allocation you forgot to free. Or just shuffle in vectors: [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764). Also, it's easier to do your scalar cleanup with plain C (. (`sum += a[i]*b[i]`;), not intrinsics, and do it at the *end*. So if your array is aligned by 16 but an odd length, you don't misalign all your vector loads. — Peter Cordes, May 04 '20 at 02:51

Precision loss when vectorising with Intel Intrinsics

0 Answers0