I am trying to program the matrix multiplication in C using simd intrinsics. I was pretty sure of my implementation, but when I execute, i get some numerical errors starting from the 5th digit of the resulting matrix's coefficients.
REAL_T is just a float with typedef
/* This is my matmul Version with simd, using floating simple precision*/
void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){
int i,j,k;
__m256 vA, vB, vC, vRes;
for (i=0; i<n; i++){
for (j=0; j<n; j++){
for (k=0; k<n; k= k+8){
vA = _mm256_load_ps(&A[i*n+k]);
vB = _mm256_loadu_ps(&B[k*n+j]);
vC = _mm256_mul_ps(vA, vB);
vC = _mm256_hadd_ps(vC, vC);
vC = _mm256_hadd_ps(vC, vC);
/*To get the resulting coefficient, after doing 2 hadds,
I have to get the first and the last element of the resulting
Vector vC*/
C[i*n+j] += ((float )(vC[0])) + ((float )(vC[7]));
} /* for k */
} /* for j */
} /* for i */
}
*/End of program
/*And this is the sequential Version*/
void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){
int i,j,k;
for (i=0; i<n; i++){
for (j=0; j<n; j++){
for (k=0; k<n; k++){
C[i*n+j] += A[i*n+k] * B[k*n+j];
} /* for k */
} /* for j */
} /* for i */
}
/*End of program*/
/*The matrix are initialized as follows*/
for (i = 0; i < n; i++)
for (j = 0; j < n; j++){
*(A+i*n+j) = 1 / ((REAL_T) (i+j+1));
*(B+i*n+j) = 1.0;
*(C+i*n+j) = 1.0;
}
/*End of initialization*/
The tested matrix are of size 512*512. For the sequential version, the top left square of the resulting matrix gives:
+6.916512e+01 +6.916512e+01
+5.918460e+01 +5.918460e+01
+7.946186e+00 +7.946186e+00
+7.936391e+00 +7.936391e+00
However, for the simd version, the square is:
+6.916510e+01 +6.916510e+01
+5.918463e+01 +5.918463e+01
+7.946147e+00 +7.946147e+00
+7.936355e+00 +7.936355e+00
There is as shown a numerical error between the 2 versions. Any help would be really appreciated !