5.10.Some Limiting Factors

\(5.10.\)Some Limiting Factors

1.Register Spilling

  If a program has a degree of parallelism \(P\) that exceeds the number of available registers, then the compiler will resort to spilling, storing some of the temporary values in memory, typically by allocating space on the run-time stack:

  Once the number of loop variables exceeds the number of available registers, the program must allocate some on the stack.

  Take the following \(10 \times 10\) unrolling loop for example:

1
2
# Updating of accumulator acc0 in 10 x 10 urolling
vmulsd (%rdx), %xmm0, %xmm0 # acc0 *= data[i]

  The comparable part of the code for \(20 \times 20\) unrolling has a much different form:

1
2
3
4
# Updating of accumulator acc0 in 20 x 20 unrolling
vmovsd 40(%rsp), %xmm0
vmulsd (%rdx), %xmm0, %xmm0
vmovsd %xmm0, 40(%rsp)

  The accumulator is kept as a local variable on the stack, at offset 40 from the stack pointer. The program must read both its value and the value of data[i] from memory, multiply them, and store the result back to memory.

2.Branch Prediction

  gcc don't use conditional transfers of control to generate code for branch prediction, but rather compute the values along both branches of a conditional expression or statement and then use conditional moves to select the desired value.


  To illustrate this, we use the program with branch prediction and that without predition as examples:

1
2
3
4
5
6
7
8
9
10
11
void combine4(vec_ptr v, data_t *dest) {
long i;
long length = vec_length(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;

for (i = 0; i < length; i++) {
acc = acc OP data[i];
}
*dest = acc;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
/* Include bounds check in loop */
void combine4b(vec_ptr v, data_t *dest) {
long i;
long length = vec_length(v);
data_t acc = IDENT;

for (i = 0; i < length; i++) {
if (i >= 0 && i < v->len) {
acc = acc OP v->data[i];
}
}
*dest = acc;
}

  The additional computation required to perform bounds checking can take place in parallel with the combining operations. The processor is able to predict the outcomes of these branches, and so none of this evaluation has much effect on the fetching and processing of the instructions that form the critical path in the program execution.

3.Write Code Suitable for Implementation with Conditional Moves

  • gcc is able to generate conditional moves for code written in a more "functional" style, where we use conditional operations to compute values and then update the program state with these values, as opposed to a more "imperative" style, where we use conditionals to selectively update program state.

  Take the program that set a[i] to be the maximum of a[i], b[i] and b[i] the minimum.

1
2
3
4
5
6
7
8
9
10
11
12
/* imperative style */
/* Rearrange two vectors so that for each i, b[i] >= a[i] */
void minmax1(long a[], long b[], long n) {
long i;
for (i = 0; i < n; i++) {
if (a[i] > b[i]) {
long t = a[i];
a[i] = b[i];
b[i] = t;
}
}
}
1
2
3
4
5
6
7
8
9
10
11
/* functional style */
/* Rearrange two vectors so that for each i, b[i] >= a[i] */
void minmax2(long a[], long b[], long n) {
long i;
for (i = 0; i < n; i++) {
long min = a[i] < b[i] ? a[i] : b[i];
long max = a[i] < b[i] ? b[i] : a[i];
a[i] = min;
b[i] = max;
}
}

  The method is like data transfer, we use less conditional expression to reduce misprediction rate.