Why isn't column-wise operation much faster than row-wise operation (as it should be) for a matrix in R

Question

Consider the following functions which store values row-wise-ly and column-wise-ly.

 #include <Rcpp.h>
using namespace Rcpp;

const int m = 10000;
const int n = 3;

// [[Rcpp::export]]
SEXP rowWise() {
    SEXP A = Rf_allocMatrix(INTSXP, m, n);
    int* p = INTEGER(A);
    int i, j;
    for (i = 0; i < m; i++){
        for(j = 0; j < n; j++) {
            p[m * j + i] = j;
        }
    }
    return A;
}

// [[Rcpp::export]]
SEXP columnWise() {
  SEXP A = Rf_allocMatrix(INTSXP, n, m);
  int* p = INTEGER(A);
  int i, j;
  for(j = 0; j < m; j++) {
    for (i = 0; i < n; i++){
      p[n * j + i] = i;
    }
  }
  return A;
}


/*** R
library(microbenchmark)
gc()
microbenchmark(
  rowWise(),
  columnWise(),
  times = 1000
)
*/

The above code yields

Unit: microseconds
         expr    min     lq     mean  median      uq       max neval
    rowWise() 12.524 18.631 64.24991 20.4540 24.8385 10894.353  1000
 columnWise() 11.803 19.434 40.08047 20.9005 24.1585  8590.663  1000

Assigning values row-wise-ly is faster (if not slower) than assigning them column-wise-ly, which is counter-intuitive to what I believe.

However, it does depend magically on the value of m and n. So I guess my question is: why columnWise is not much faster than rowWise?

I find your expectation confusing (in particular because there are probably also compiler optimizations to consider). Can you please explain why you expect `columnWise` to be much faster? — Roland, Aug 07 '18 at 06:35

李哲源 · Accepted Answer · 2018-08-13T21:11:31.450

The dimension (shape) of the matrix has an impact.

When we do a row-wise scan of a 10000 x 3 integer matrix A, we can still effectively do caching. For simplicity of illustration, I assume that each column of A are aligned to a cache line.

--------------------------------------
A[1, 1] A[1, 2] A[1, 3]        M  M  M
A[2, 1] A[2, 2] A[2, 3]        H  H  H
   .        .       .          .  .  .
   .        .       .          .  .  .
A[16,1] A[16,2] A[16,3]        H  H  H
--------------------------------------
A[17,1] A[17,2] A[17,3]        M  M  M
A[18,1] A[18,2] A[18,3]        H  H  H
   .        .       .          .  .  .
   .        .       .          .  .  .
A[32,1] A[32,2] A[32,3]        H  H  H
--------------------------------------
A[33,1] A[33,2] A[33,3]        M  M  M
A[34,1] A[34,2] A[34,3]        H  H  H
   .        .       .          .  .  .
   .        .       .          .  .  .

A 64-bit cache line can hold 16 integers. When we access A[1, 1], a full cache line is filled, that is, A[1, 1] to A[16, 1] are all loaded into cache. When we scan a row A[1, 1], A[1, 2], A[1, 3], a 16 x 3 matrix is loaded into cache and it is much smaller than cache capacity (32 KB). While we have a cache miss (M) for each element in the 1st row, when we start to scan the 2nd row, we have a cache hit (H) for every element. So we have a periodic pattern as such:

[3 Misses] -> [45 Hits] -> [3 Misses] -> [45 Hits] -> ...

That is, we have on average a cache miss ratio of 3 / 48 = 1 / 16 = 6.25%. In fact, this equals to the cache miss ratio if we scan A column-wise, where we have the following periodic pattern:

[1 Miss] -> [15 Hits] -> [1 Miss] -> [15 Hits] -> ...

Try a 5000 x 5000 matrix. In that case, after reading the first row, 16 x 5000 elements are fetched into cache but that is much larger than cache capacity so cache eviction has happened to kick out the A[1, 1] to A[16, 1] (most cache applies "least recently unused" cache line replacement policy). When we come back to scan the 2nd row, we have to fetch A[2, 1] from RAM again. So a row-wise scan gives a cache miss ratio of 100%. In contrasts, a column-wise scan only has a cache miss ratio of 1 / 16 = 6.25%. In this example, we will observe that column-wise scan is much faster.

In summary, with a 10000 x 3 matrix, we have the same cache performance whether we scan it by row or column. I don't see that rowWise is faster than columnWise from the median time reported by microbenchmark. Their execution time may not be exactly equal, but the difference is too minor to cause our concern.

For a 5000 x 5000 matrix, rowWise is much slower than columnWise.

Thanks for verification.

Remark

The "golden rule" that we should ensure sequential memory access in the innermost loop is a general guideline for efficiency. But don't understand it in the narrow sense.

In fact, if you treat the three columns of A as three vectors x, y, z, and consider the element-wise addition (i.e., the row-wise sum of A): z[i] = x[i] + y[i], are we not having a sequential access for all three vectors? Doesn't this fall into the "golden rule"? Scanning a 10000 x 3 matrix by row is no difference from alternately reading three vectors sequentially. And this is very efficient.

Thanks for the detailed explanation! I still think that there is some magic for the 10000 x 3 matrix. I tried `times = 100000`, the median for `rowWise` is 19.33 and that for `columnWise` is 19.80. I don't think such a difference is due to randomness. — Randy Lai, Aug 07 '18 at 07:18
One may expect column-wise operation is always faster than row-wise operation if they know a little bit (but not much) about cpu cache (as I did). Thank you again for the effort. I have a better understanding now. — Randy Lai, Aug 07 '18 at 15:59

Why isn't column-wise operation much faster than row-wise operation (as it should be) for a matrix in R

1 Answers1

Remark