0

I have vectorised the following single instruction stream, single data stream(SISD) code with Intel's SSE intrinsics libraries(versions 1->4.2),

  float routine(float * restrict a, float * restrict b, int size)
  {
    float sum = 0.0;
    for ( int i = 0; i < size; i++ )
    {
      sum = sum + a[i] * b[i];
    }
    return sum;
  }

into the following single instruction, multiple data(SIMD) code,

  float vectorised_routine(float * restrict a, float * restrict b, int size)
  {
    float sum= 0.0;
    __m128 vSum=  _mm_setzero_ps(); //declare and initialise our vector vars to zero
    __m128 vTemp=  _mm_setzero_ps();
    __m128 vTemp2=  _mm_setzero_ps();
    int vecOpCount= size - (size % VECTOR_LEN); //number of vector ops to be performed
    int i= 0;
    for (; i < vecOpCount; i+= VECTOR_LEN) //VECTOR_LEN= 4
    {
      vTemp= _mm_load_ps (&a[i]);
      vTemp2= _mm_load_ps (&b[i]);
      vTemp= _mm_mul_ps (vTemp2, vTemp);
      vSum= _mm_add_ps (vSum, vTemp);
    }
    int leftover= size - i; //Remaining ops. Always less then VECTOR_LEN
    for(int j= 0; j < leftover; j++)
    {
      vTemp= _mm_load_ss (&a[i+j]);
      vTemp2= _mm_load_ss (&b[i+j]);
      vTemp= _mm_mul_ps (vTemp2, vTemp);
      vSum= _mm_add_ps (vSum, vTemp);
    }
    float* storeBay= malloc(sizeof(float)*VECTOR_LEN);
    _mm_store_ps (storeBay, vSum);
    for(int j= 0; j < VECTOR_LEN; j++)
    {
      sum+= storeBay[j];
    }
    return sum;
  }

Comparing the sums returned by my SISD and SIMD code, there is no difference for the vast majority of my widely-varying sample inputs. When there is a difference, it is an integer(albeit in float form of course) and it is rarely larger then 10.

Seeing as the disparity between my sums is usually quite small, I am led to believe this could be an issue relating to precision. However, that the disparity is always an integer leads me to believe it is not a precision related issue, or if so, it is in an indirect manner.

Do you know why my routines sometimes return different values?

  • One thing to realize is that the order in which the products are added might not be the same. Maybe this matters? Have you considered differences in the rounding mode? – Jupiter May 03 '20 at 21:15
  • 2
    Floating-point addition is not associative. Also, you're leaking memory (exactly why you decided to allocate is unclear, as is why you hand-write intrinsics for the scalar residue after the vectorized part, or why you zero-initialize locals that are always initialized by loading form memory, and which should be declared in the innermost scope instead). – EOF May 03 '20 at 21:34
  • 1
    How large are your elements, how many do you add and how large is the sum, typically? An absolute error of `10` compared to a sum of `1e6` will happen almost certainly due to the different order of operations. If your sum was something like `1000`, an error of 10 would be a lot. – chtz May 03 '20 at 21:54
  • Unrelated, but `malloc(sizeof(float)*VECTOR_LEN);` is insane. Use `alignas(16) float storeBay[4];` - a local 16-byte array is 100% fine and much better than leaking a dynamic allocation you forgot to free. Or just shuffle in vectors: [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764). Also, it's easier to do your scalar cleanup with plain C (. (`sum += a[i]*b[i]`;), not intrinsics, and do it at the *end*. So if your array is aligned by 16 but an odd length, you don't misalign all your vector loads. – Peter Cordes May 04 '20 at 02:51

0 Answers0