6.6.The Impact of Caches on Program Performance
\(6.6.\)The Impact of Caches on Program Performance
1.Rearranging Loops to Increase Spatial Locality
Consider the problem of multiplying a pair of \(2 \times 2\) matrices: \(C = AB\):
\[ \begin{bmatrix} c_{11} & c_{12} \\ c_{21} & c_{22} \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \cdot \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} \]
and we make the following assumptions:
Each array is an \(n \times n\) array of double, with
sizeof(double) = 8
.There is a single cache with a 32-byte block size(B = 32).
The array size n is so large that a single matrix row does not fit in the L1 cache.
The compiler stores local variables in registers, and thus references to local variables inside loops do not require any load or store instructions.
We can write three different versions:
1 | for (i = 0; i < n; i++) |
1 | for (j = 0; j < n; j++) |
1 | for (k = 0; k < n; k++) |
The third routine present an interesting trade-off: With two loads and a store, they require one more memory operation than the first routines. On the other hand, since the inner loop scans both B and C row-wise with a stride-1 access pattern, the miss rate on each array is only 0.25 misses per iteration, for a total of 0.50 misses per iteration.
The performance result is as below:
We can see that in this case, miss rate is a better predictor of performance than the total number of memory accesses.
2.Exploiting Locality
Focus your attention on the inner loops.
Try to maximize the spatial locality in your programs by reading data objects sequentially, with stride 1, in the order they are stored in memory.
Try to maximize the temporal locality in your programs by using a data object as often as possible once it has been read from memory.