5.8.Loop Unrolling

\(5.8.\)Loop Unrolling

  Loop unrolling can improve performance in two ways.

  1. It reduces the number of operations that do not contribute directly to the program result, such as loop indexing and conditional branching.

  2. It exposes ways in which we can further transform the code to reduce the number of operations in the critical paths of the overall computation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* 2 x 1 loop unrolling */
void combine5(vec_ptr v, data_t *dest)
{
long i;
long length = vec_length(v);
long limit = length - 1;
data_t *data = get_vec_start(v);
data_t acc = IDENT;

/* Combine 2 elements at a time */
for (i = 0; i < limit; i += 2) {
acc = (acc OP data[i]) OP data[i + 1];
}

/* Finish any remaining elements */
for (; i < length; i++) {
acc = acc OP data[i];
}
*dest = acc;
}

  The performance of different unrolling is as below:

  We see that the CPE improves little when \(k \geq 2\). To understand this, consider the case when \(k=2\):

1
2
3
4
5
6
7
8
# Inner loop of combine5. data_t = double, OP = *
# i in %rdx, data %rax, limit in %rbx, acc in %xmm0
.L35: loop:
vmulsd (%rax,%rdx,8), %xmm0, %xmm0 # Multiply acc by data[i]
vmulsd 8(%rax,%rdx,8), %xmm0, %xmm0 # Multiply acc by data[i+1]
addq $2, %rdx # Increment i by 2
cmpq %rdx, %rbp # Compare to limit: i
jg .L35 # If >, goto loop

  The vmulsd instructions each get translated into two operations: one to load an array element from memory and one to multiply this value by the accumulated value. The data-flow representation of the program is as below:

  There's still a critical path of \(n\)