I am currently trying to write a program with an L1 miss rate that is as high as possible.
To measure the L1 miss rate I am using the MEM_LOAD_RETIRED.L1_MISS and MEM_LOAD_RETIRED.L1_HIT performance counter events on an Intel Core i7 processor (I am not interested in fill buffer hits). I modified the Linux kernel to give me exact measurements at each context switch so that I can determine precisely how many hits and misses each program gets.
The hardware prefetcher is disabled.
This is the code that I currently have:
#define LINE_SIZE 64
#define CACHE_SIZE 4096 * 8
#define MEM_SIZE CACHE_SIZE * 64
void main(int argc, char* argv[])
{
volatile register char* addr asm ("r12") = mmap(0, MEM_SIZE, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
volatile register unsigned long idx asm ("r13") = 0;
volatile register unsigned long store_val asm ("r14") = 0;
volatile register unsigned long x64 asm ("r15") = 88172645463325252ull;
while(1)
{
x64 ^= x64 << 13;
x64 ^= x64 >> 7;
x64 ^= x64 << 17;
store_val = addr[x64 % MEM_SIZE];
}
}
This code produces exactly one memory access per loop iteration, so my question is: Why is the miss rate that I am getting here close to 0%? Even without the xorshift and just linearly accessing the array (edit: increasing the index by 64 on each access) I should be getting a close to 100% miss rate, right? What am I missing here?
Thanks in advance! :)
Update: With callgrind I am getting the expected 99.9% miss rate when doing linear accesses. I don't understand.
Using the perf-tool with:
perf stat -r 10 -B -e mem_load_retired.l1_miss,mem_load_retired.l1_hit ./thrasher
gives me similar results to the one I'm getting using my modified kernel's output.
CodePudding user response:
x64 % MEM_SIZE
is x64 % CACHE_SIZE * 64
which is x64 % 4096 * 16 * 64
which is ((x64 % 4096) * 16) * 64
which touches at most 4096 different locations.
Put your macro replacement text in parentheses.
(I am not positive this is the whole explanation, since 4096 locations at 1024-byte intervals ought to map to the same cache sets repeatedly, but maybe this combined with the shifting and XORing produces a pattern that hits multiple times before moving on. A quick test shows the first 4096 iterations touch only 2558 locations.)
CodePudding user response:
UPDATE:
I finally found the issue! Really a rather stupid mistake: I assume Linux has some sort of zero page deduplication going on, so because I did not initialize the array all the accesses were presumably mapped to the same physical page, obviously leading to a high hit rate.
I updated my code like so:
#define PAGE 4096
#define LINE_SIZE 64
#define CACHE_SIZE 4096 * 8
#define MEM_SIZE (CACHE_SIZE * 512)
char array[MEM_SIZE] = {[0 ... MEM_SIZE-1] = 5};
void main(int argc, char* argv[])
{
volatile register char* addr asm ("r12") = array;
volatile register unsigned long idx asm ("r13") = 0;
volatile register unsigned long store_val asm ("r14") = 0;
volatile register unsigned long cnt asm ("r15") = 0;
while(cnt < 100000000)
{
idx = PAGE;
idx %= MEM_SIZE;
store_val = addr[idx];
}
}
and am now getting my desired ~100% miss rate.
Thanks everyone for weighing in and trying to help, much appreciated!