I have read chapter 5 of CSAPP 3e. I want to test if the optimization techniques described in the book can work on my computer. I write the following program:
#define SIZE (1024)
int main(int argc, char* argv[]) {
int sum = 0;
int* array = malloc(sizeof(int) * SIZE);
unsigned long long before = __rdtsc();
for (int i = 0; i < SIZE; i) {
sum = array[i];
}
unsigned long long after = __rdtsc();
double cpe = (double)(after - before) / SIZE;
printf("CPE is %f\n", cpe);
printf("sum is %d\n", sum);
return 0;
}
and it reports the CPE is around 1.00.
I transform the program using the 4x4 loop unrolling technique and it leads to the following program:
#define SIZE (1024)
int main(int argc, char* argv[]) {
int sum = 0;
int* array = malloc(sizeof(int) * SIZE);
int sum0 = 0;
int sum1 = 0;
int sum2 = 0;
int sum3 = 0;
/* 4x4 unrolling */
unsigned long long before = __rdtsc();
for (int i = 0; i < SIZE; i = 4) {
sum0 = array[i];
sum1 = array[i 1];
sum2 = array[i 2];
sum3 = array[i 3];
}
unsigned long long after = __rdtsc();
sum = sum0 sum1 sum2 sum3;
double cpe = (double)(after - before) / SIZE;
printf("CPE is %f\n", cpe);
printf("sum is %d\n", sum);
return 0;
}
Note that I omit the code to handle the situation when SIZE
is not a multiple of 4. This program reports the CPE is around 0.80.
My program runs on an AMD 5950X, and according to AMD's software optimization manual (https://developer.amd.com/resources/developer-guides-manuals/), the integer addition instruction has a latency of 1 cycle and throughput of 4 instructions per cycle. It also has a load-store unit which could execute three independent load operations at the same time. My expectation of the CPE is 0.33, and I do not know why the result is so much higher.
My compiler is gcc 12.2.0. All programs are compiled with flags -Og
.
I checked the assembly code of the optimized program, but found nothing helpful:
.L4:
movslq %r9d, %rcx
addl (%r8,%rcx,4), %r11d
addl 4(%r8,%rcx,4), %r10d
addl 8(%r8,%rcx,4),