I have vectorised the following single instruction stream, single data stream(SISD) code with Intel's SSE intrinsics libraries(versions 1->4.2),
float routine(float * restrict a, float * restrict b, int size)
{
float sum = 0.0;
for ( int i = 0; i < size; i++ )
{
sum = sum + a[i] * b[i];
}
return sum;
}
into the following single instruction, multiple data(SIMD) code,
float vectorised_routine(float * restrict a, float * restrict b, int size)
{
float sum= 0.0;
__m128 vSum= _mm_setzero_ps(); //declare and initialise our vector vars to zero
__m128 vTemp= _mm_setzero_ps();
__m128 vTemp2= _mm_setzero_ps();
int vecOpCount= size - (size % VECTOR_LEN); //number of vector ops to be performed
int i= 0;
for (; i < vecOpCount; i+= VECTOR_LEN) //VECTOR_LEN= 4
{
vTemp= _mm_load_ps (&a[i]);
vTemp2= _mm_load_ps (&b[i]);
vTemp= _mm_mul_ps (vTemp2, vTemp);
vSum= _mm_add_ps (vSum, vTemp);
}
int leftover= size - i; //Remaining ops. Always less then VECTOR_LEN
for(int j= 0; j < leftover; j++)
{
vTemp= _mm_load_ss (&a[i+j]);
vTemp2= _mm_load_ss (&b[i+j]);
vTemp= _mm_mul_ps (vTemp2, vTemp);
vSum= _mm_add_ps (vSum, vTemp);
}
float* storeBay= malloc(sizeof(float)*VECTOR_LEN);
_mm_store_ps (storeBay, vSum);
for(int j= 0; j < VECTOR_LEN; j++)
{
sum+= storeBay[j];
}
return sum;
}
Comparing the sums returned by my SISD and SIMD code, there is no difference for the vast majority of my widely-varying sample inputs. When there is a difference, it is an integer(albeit in float form of course) and it is rarely larger then 10.
Seeing as the disparity between my sums is usually quite small, I am led to believe this could be an issue relating to precision. However, that the disparity is always an integer leads me to believe it is not a precision related issue, or if so, it is in an indirect manner.
Do you know why my routines sometimes return different values?