6.6.The Impact of Caches on Program Performance

\(6.6.\)The Impact of Caches on Program Performance

1.Rearranging Loops to Increase Spatial Locality

  Consider the problem of multiplying a pair of \(2 \times 2\) matrices: \(C = AB\):

\[ \begin{bmatrix} c_{11} & c_{12} \\ c_{21} & c_{22} \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \cdot \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} \]

  and we make the following assumptions:

  • Each array is an \(n \times n\) array of double, with sizeof(double) = 8.

  • There is a single cache with a 32-byte block size(B = 32).

  • The array size n is so large that a single matrix row does not fit in the L1 cache.

  • The compiler stores local variables in registers, and thus references to local variables inside loops do not require any load or store instructions.

  We can write three different versions:

1
2
3
4
5
6
7
for (i = 0; i < n; i++)
for (j = 0; j < n; j++) {
sum = 0.0;
for (k = 0; k < n; k++)
sum += A[i][k]*B[k][j];
C[i][j] += sum;
}
1
2
3
4
5
6
for (j = 0; j < n; j++)
for (k = 0; k < n; k++) {
r = B[k][j];
for (i = 0; i < n; i++)
C[i][j] += A[i][k]*r;
}
1
2
3
4
5
6
for (k = 0; k < n; k++)
for (i = 0; i < n; i++) {
r = A[i][k];
for (j = 0; j < n; j++)
C[i][j] += r * B[k][j];
}

  The third routine present an interesting trade-off: With two loads and a store, they require one more memory operation than the first routines. On the other hand, since the inner loop scans both B and C row-wise with a stride-1 access pattern, the miss rate on each array is only 0.25 misses per iteration, for a total of 0.50 misses per iteration.

  The performance result is as below:

  We can see that in this case, miss rate is a better predictor of performance than the total number of memory accesses.

2.Exploiting Locality

  • Focus your attention on the inner loops.

  • Try to maximize the spatial locality in your programs by reading data objects sequentially, with stride 1, in the order they are stored in memory.

  • Try to maximize the temporal locality in your programs by using a data object as often as possible once it has been read from memory.