The dimension (shape) of the matrix has an impact.
When we do a row-wise scan of a 10000 x 3
integer matrix A
, we can still effectively do caching. For simplicity of illustration, I assume that each column of A
are aligned to a cache line.
--------------------------------------
A[1, 1] A[1, 2] A[1, 3] M M M
A[2, 1] A[2, 2] A[2, 3] H H H
. . . . . .
. . . . . .
A[16,1] A[16,2] A[16,3] H H H
--------------------------------------
A[17,1] A[17,2] A[17,3] M M M
A[18,1] A[18,2] A[18,3] H H H
. . . . . .
. . . . . .
A[32,1] A[32,2] A[32,3] H H H
--------------------------------------
A[33,1] A[33,2] A[33,3] M M M
A[34,1] A[34,2] A[34,3] H H H
. . . . . .
. . . . . .
A 64-bit cache line can hold 16 integers. When we access A[1, 1]
, a full cache line is filled, that is, A[1, 1]
to A[16, 1]
are all loaded into cache. When we scan a row A[1, 1], A[1, 2], A[1, 3]
, a 16 x 3
matrix is loaded into cache and it is much smaller than cache capacity (32 KB). While we have a cache miss (M) for each element in the 1st row, when we start to scan the 2nd row, we have a cache hit (H) for every element. So we have a periodic pattern as such:
[3 Misses] -> [45 Hits] -> [3 Misses] -> [45 Hits] -> ...
That is, we have on average a cache miss ratio of 3 / 48 = 1 / 16 = 6.25%
. In fact, this equals to the cache miss ratio if we scan A
column-wise, where we have the following periodic pattern:
[1 Miss] -> [15 Hits] -> [1 Miss] -> [15 Hits] -> ...
Try a 5000 x 5000
matrix. In that case, after reading the first row, 16 x 5000
elements are fetched into cache but that is much larger than cache capacity so cache eviction has happened to kick out the A[1, 1]
to A[16, 1]
(most cache applies "least recently unused" cache line replacement policy). When we come back to scan the 2nd row, we have to fetch A[2, 1]
from RAM again. So a row-wise scan gives a cache miss ratio of 100%
. In contrasts, a column-wise scan only has a cache miss ratio of 1 / 16 = 6.25%
. In this example, we will observe that column-wise scan is much faster.
In summary, with a 10000 x 3
matrix, we have the same cache performance whether we scan it by row or column. I don't see that rowWise
is faster than columnWise
from the median time reported by microbenchmark
. Their execution time may not be exactly equal, but the difference is too minor to cause our concern.
For a 5000 x 5000
matrix, rowWise
is much slower than columnWise
.
Thanks for verification.
Remark
The "golden rule" that we should ensure sequential memory access in the innermost loop is a general guideline for efficiency. But don't understand it in the narrow sense.
In fact, if you treat the three columns of A
as three vectors x
, y
, z
, and consider the element-wise addition (i.e., the row-wise sum of A
): z[i] = x[i] + y[i]
, are we not having a sequential access for all three vectors? Doesn't this fall into the "golden rule"? Scanning a 10000 x 3
matrix by row is no difference from alternately reading three vectors sequentially. And this is very efficient.