Program with intentionally high L1 cache miss rate-CodePudding

I am currently trying to write a program with an L1 miss rate that is as high as possible.

To measure the L1 miss rate I am using the MEM_LOAD_RETIRED.L1_MISS and MEM_LOAD_RETIRED.L1_HIT performance counter events on an Intel Core i7 processor (I am not interested in fill buffer hits). I modified the Linux kernel to give me exact measurements at each context switch so that I can determine precisely how many hits and misses each program gets.

The hardware prefetcher is disabled.

This is the code that I currently have:

#define LINE_SIZE 64
#define CACHE_SIZE 4096 * 8
#define MEM_SIZE CACHE_SIZE * 64


void main(int argc, char* argv[])
{

    volatile register char* addr asm ("r12") = mmap(0, MEM_SIZE, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

    volatile register unsigned long idx asm ("r13") = 0;
    volatile register unsigned long store_val asm ("r14") = 0;

    volatile register unsigned long x64 asm ("r15") = 88172645463325252ull;

    while(1)
    {
        x64 ^= x64 << 13;
        x64 ^= x64 >> 7;
        x64 ^= x64 << 17;

        store_val = addr[x64 % MEM_SIZE];
    }
}

This code produces exactly one memory access per loop iteration, so my question is: Why is the miss rate that I am getting here close to 0%? Even without the xorshift and just linearly accessing the array (edit: increasing the index by 64 on each access) I should be getting a close to 100% miss rate, right? What am I missing here?

Thanks in advance! :)

Update: With callgrind I am getting the expected 99.9% miss rate when doing linear accesses. I don't understand.

Using the perf-tool with:

perf stat -r 10 -B -e mem_load_retired.l1_miss,mem_load_retired.l1_hit ./thrasher

gives me similar results to the one I'm getting using my modified kernel's output.

CodePudding user response：

x64 % MEM_SIZE is x64 % CACHE_SIZE * 64 which is x64 % 4096 * 16 * 64 which is ((x64 % 4096) * 16) * 64 which touches at most 4096 different locations.

Put your macro replacement text in parentheses.

(I am not positive this is the whole explanation, since 4096 locations at 1024-byte intervals ought to map to the same cache sets repeatedly, but maybe this combined with the shifting and XORing produces a pattern that hits multiple times before moving on. A quick test shows the first 4096 iterations touch only 2558 locations.)

CodePudding user response：

UPDATE:

I finally found the issue! Really a rather stupid mistake: I assume Linux has some sort of zero page deduplication going on, so because I did not initialize the array all the accesses were presumably mapped to the same physical page, obviously leading to a high hit rate.

I updated my code like so:

#define PAGE 4096
#define LINE_SIZE 64
#define CACHE_SIZE 4096 * 8
#define MEM_SIZE (CACHE_SIZE * 512)

char array[MEM_SIZE] = {[0 ... MEM_SIZE-1] = 5};


void main(int argc, char* argv[])
{
    volatile register char* addr asm ("r12") = array;

    volatile register unsigned long idx asm ("r13") = 0;
    volatile register unsigned long store_val asm ("r14") = 0;

    volatile register unsigned long cnt asm ("r15") = 0;

    while(cnt   < 100000000)
    {
        idx  = PAGE;
        idx %= MEM_SIZE;

        store_val  = addr[idx];
    }
}

and am now getting my desired ~100% miss rate.

Thanks everyone for weighing in and trying to help, much appreciated!