5.3.Optimizing Methods

5.3.Optimizing Methods

1.Code Motion

1
2
3
4
5
6
7
8
9
10
void combine1(vec_ptr v, data_t *dest) {
long i;

*dest = IDENT;
for (i = 0; i < vec_length(v); i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}

  Since the length of v is fixed, it's needless to compute its length for each iteration. So we can take this computation alone:

1
2
3
4
5
6
7
8
9
10
11
void combine2(vec_ptr v, data_t *dest) {
long i;
long length = vec_length(v);

*dest = IDENT;
for (i = 0; i < length; i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}

2.Reduce Procedure Calls

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
data_t *get_vec_start(vec_ptr v) {
return v->data;
}

/* Direct access to vector data */
void combine3(vec_ptr v, data_t *dest) {
long i;
long length = vec_length(v);
data_t *data = get_vec_start(v);

*dest = IDENT;
for (i = 0; i < length; i++) {
*dest = *dest OP data[i];
}
}

3.Eliminating Unneeded Memory References

  The assembly code for combine3 is as below:

1
2
3
4
5
6
7
8
9
10
# Inner loop of combine3. data_t = double, OP = * 
# dest in %rbx, data+i in %rdx, data+length in %rax

.L17: loop:
vmovsd (%rbx), %xmm0 # Read product from dest
vmulsd (%rdx), %xmm0, %xmm0 # Multiply product by data[i]
vmovsd %xmm0, (%rbx) # Store product at dest
addq $8, %rdx # Increment data+i
cmpq %rax, %rdx # Compare to data+length
jne .L17 # If !=, goto loop

  We can see that the accumulated values read from and written to memory on each iteration. We can eliminate this needless reading and writing of memory by rewriting the code in the style of combine4:

1
2
3
4
5
6
7
8
9
10
11
void combine4(vec_ptr v, data_t *dest) {
long i;
long length = vec_length(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;

for (i = 0; i < length; i++) {
acc = acc OP data[i];
}
*dest = acc;
}

  With the following assembly code:

1
2
3
4
5
6
7
# Inner loop of combine4. data_t = double, OP = *
# acc in %xmm0, data+i in %rdx, data+length in %rax
.L25: loop:
vmulsd (%rdx), %xmm0, %xmm0 # Multiply acc by data[i]
addq $8, %rdx # Increment data+i
cmpq %rax, %rdx # Compare to data+length
jne .L25 # If !=, goto loop

  The key is to dereference the pointer as fewer as possible.

  However, this program will lead to different result due to memory alias:

1
2
combine3(v, get_vec_start(v) + 2);
combine4(v, get_vec_start(v) + 2)


  As a result, we modify the combine3 as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* Make sure dest updated on each iteration */
void combine3w(vec_ptr v, data_t *dest) {
long i;
long length = vec_length(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;

/* Initialize in event length <= 0 */
*dest = acc;

for (i = 0; i < length; i++) {
acc = acc OP data[i];
*dest = acc;
}
}

  with the following assembly code:

1
2
3
4
5
6
7
8
9
# Inner loop of combine3. data_t = double, OP = *. Compiled -O1
# dest in %rbx, data+i in %rdx, data+length in %rax
.L17: loop:
vmovsd (%rbx), %xmm0 # Read product from dest
vmulsd (%rdx), %xmm0, %xmm0 # Multiply product by data[i]
vmovsd %xmm0, (%rbx) # Store product at dest
addq $8, %rdx # Increment data+i
cmpq %rax, %rdx # Compare to data+length
jne .L17 # If !=, goto loop

  This program dereferences the dest less while avoiding the memory alias.