Why the second program performs worse, even though it should have considerably less cache misses?-CodePudding

Consider the following programs:

#include <stdio.h>
#include <stdlib.h>

typedef unsigned long long u64;

int program_1(u64* a, u64* b)
{
  const u64 lim = 50l * 1000l * 1000l;
  // Reads arrays
  u64 sum = 0;
  for (u64 i = 0; i < lim * 100;   i) {
    sum  = a[i % lim];
    sum  = b[i % lim];
  }

  printf("%llu\n", sum);
  return 0;
}


int program_2(u64* a, u64* b)
{
  const u64 lim = 50l * 1000l * 1000l;
  // Reads arrays
  u64 sum = 0;
  for (u64 i = 0; i < lim * 100;   i) {
    sum  = a[i % lim];
  }
  for (u64 i = 0; i < lim * 100;   i) {
    sum  = b[i % lim];
  }

  printf("%llu\n", sum);
  return 0;
}

Both programs are identical: they fill up an array with 1s, then read every element 100 times, adding to a counter. The only difference is the first one fuses the adder loop, while the second one performs two separate passes. Given that M1 has a 64KB of L1 data cache, my understanding is that the following would happen:

Program 1

sum  = a[0] // CACHE MISS. Load a[0..8192] on L1.
sum  = b[0] // CACHE MISS. Load b[0..8192] on L1.
sum  = a[1] // CACHE MISS. Load a[0..8192] on L1.
sum  = b[1] // CACHE MISS. Load b[0..8192] on L1.
sum  = a[2] // CACHE MISS. Load a[0..8192] on L1.
sum  = b[2] // CACHE MISS. Load b[0..8192] on L1.
(...)

Program 2

sum  = a[0] // CACHE MISS. Load a[0..8192] on L1.
sum  = a[1] // CACHE HIT!
sum  = a[2] // CACHE HIT!
sum  = a[3] // CACHE HIT!
sum  = a[4] // CACHE HIT!
...
sum  = a[8192] // CACHE MISS. Load a[8192..16384] on L1.
sum  = a[8193] // CACHE HIT!
sum  = a[8194] // CACHE HIT!
sum  = a[8195] // CACHE HIT!
sum  = a[8196] // CACHE HIT!
...
...
sum  = b[0] // CACHE MISS. Load b[0..8192] on L1.
sum  = b[1] // CACHE HIT!
sum  = b[2] // CACHE HIT!
sum  = b[3] // CACHE HIT!
sum  = b[4] // CACHE HIT!
...

This would lead me to believe that the first program is slower, since every read is a cache miss, while the second one consists majorly of cache hits. The results, though, differ. Running on a Macbook Pro M1, with clang -O2, the first program takes 2.8s to complete, while the second one takes about 3.8s.

What is wrong about my mental model of how the L1 cache works?

CodePudding user response：

I'd expect that:

a) while the CPU is waiting for data to be fetched into L1 for the sum = a[i % lim]; it can ask for data to be fetched for the sum = b[i % lim]; into L1. Essentially; Program 1 is waiting for 2 cache misses in parallel while Program 2 is waiting for 1 cache miss at a time and could be up to twice as slow.

b) The loop overhead (all the work in for (u64 i = 0; i < lim * 100; i) {), and the indexing (calculating i%lim) is duplicated in Program 2; causing Program 2 to do almost twice as much extra work (that has nothing to do with cache misses).

c) The compiler is bad at optimising. I'm surprised the same code wasn't generated for both versions. I'm shocked that neither CLANG nor GCC managed to auto-vectorize (use SIMD). A very hypothetical idealized perfect compiler should be able to optimize both versions all the way down to write(STDOUT_FILENO, "10000000000\n", 12); return 0.

What is wrong about my mental model of how the L1 cache works?

It looks like you thought the cache can only cache one thing at a time. For Program 1 it would be more like:

sum  = a[0] // CACHE MISS
sum  = b[0] // CACHE MISS
sum  = a[1] // CACHE HIT (data still in cache)
sum  = b[1] // CACHE HIT (data still in cache)
sum  = a[2] // CACHE HIT (data still in cache)
sum  = b[2] // CACHE HIT (data still in cache)
sum  = a[3] // CACHE HIT (data still in cache)
sum  = b[3] // CACHE HIT (data still in cache)
sum  = a[4] // CACHE HIT (data still in cache)
sum  = b[4] // CACHE HIT (data still in cache)
sum  = a[5] // CACHE HIT (data still in cache)
sum  = b[5] // CACHE HIT (data still in cache)
sum  = a[6] // CACHE HIT (data still in cache)
sum  = b[6] // CACHE HIT (data still in cache)
sum  = a[7] // CACHE HIT (data still in cache)
sum  = b[7] // CACHE HIT (data still in cache)

sum  = a[8] // CACHE MISS
sum  = b[8] // CACHE MISS

For program 2 it's probably (see note) the same number of cache misses in a different order:

sum  = a[0] // CACHE MISS
sum  = a[1] // CACHE HIT (data still in cache)
sum  = a[2] // CACHE HIT (data still in cache)
sum  = a[3] // CACHE HIT (data still in cache)
sum  = a[4] // CACHE HIT (data still in cache)
sum  = a[5] // CACHE HIT (data still in cache)
sum  = a[6] // CACHE HIT (data still in cache)
sum  = a[7] // CACHE HIT (data still in cache)

sum  = a[8] // CACHE MISS

..then:

sum  = b[0] // CACHE MISS
sum  = b[1] // CACHE HIT (data still in cache)
sum  = b[2] // CACHE HIT (data still in cache)
sum  = b[3] // CACHE HIT (data still in cache)
sum  = b[4] // CACHE HIT (data still in cache)
sum  = b[5] // CACHE HIT (data still in cache)
sum  = b[6] // CACHE HIT (data still in cache)
sum  = b[7] // CACHE HIT (data still in cache)

sum  = b[8] // CACHE MISS

NOTE: I assumed any array is larger than cache. If cache was large enough to hold an entire array but too small to hold both arrays; then Program 2 would probably be faster than Program 1. This is the only case where Program 2 would be faster.