Why cannot my program reach integer addition instruction throughput bound?-CodePudding

I have read chapter 5 of CSAPP 3e. I want to test if the optimization techniques described in the book can work on my computer. I write the following program:

#define SIZE (1024)
int main(int argc, char* argv[]) {
  int sum = 0;
  int* array = malloc(sizeof(int) * SIZE);
  unsigned long long before = __rdtsc();
  for (int i = 0; i < SIZE;   i) {
    sum  = array[i];
  }
  unsigned long long after = __rdtsc();
  double cpe = (double)(after - before) / SIZE;
  printf("CPE is %f\n", cpe);
  printf("sum is %d\n", sum);
  return 0;
}

and it reports the CPE is around 1.00.

I transform the program using the 4x4 loop unrolling technique and it leads to the following program:

#define SIZE (1024)
int main(int argc, char* argv[]) {
  int sum = 0;
  int* array = malloc(sizeof(int) * SIZE);

  int sum0 = 0;
  int sum1 = 0;
  int sum2 = 0;
  int sum3 = 0;
  /* 4x4 unrolling */
  unsigned long long before = __rdtsc();
  for (int i = 0; i < SIZE; i  = 4) {
    sum0  = array[i];
    sum1  = array[i   1];
    sum2  = array[i   2];
    sum3  = array[i   3];
  }
  unsigned long long after = __rdtsc();
  sum = sum0   sum1   sum2   sum3;
  double cpe = (double)(after - before) / SIZE;
  printf("CPE is %f\n", cpe);
  printf("sum is %d\n", sum);
  return 0;
}

Note that I omit the code to handle the situation when SIZE is not a multiple of 4. This program reports the CPE is around 0.80.

My program runs on an AMD 5950X, and according to AMD's software optimization manual (https://developer.amd.com/resources/developer-guides-manuals/), the integer addition instruction has a latency of 1 cycle and throughput of 4 instructions per cycle. It also has a load-store unit which could execute three independent load operations at the same time. My expectation of the CPE is 0.33, and I do not know why the result is so much higher.

My compiler is gcc 12.2.0. All programs are compiled with flags -Og.

I checked the assembly code of the optimized program, but found nothing helpful:

.L4:
        movslq  %r9d, %rcx
        addl    (%r8,%rcx,4), %r11d
        addl    4(%r8,%rcx,4), %r10d
        addl    8(%r8,%rcx,4),