Problem description
I am trying to evaluate the performance by changing the size of the caches using gem5 simulator (x86 ISA in SE mode). I used the example configuration script (gem5/configs/example/se.py, which uses the script gem5/configs/common/CacheConfig.py for the configuration of cache memories).
I noticed that:
- Using caches ("--caches" option), performance increase. It sounds good!
- Using caches and enable also L2 cache level ("--caches", "--l2cache" options) performance does not increase.
I tried 2 configurations for the caches, and I obtain the following results:
- L1D=32kB, L2=256kB
simSeconds 0.063785 system.cpu.dcache.overallMisses::total 196608 system.cpu.dcache.overallHits::total 13434883 system.l2.overallMisses::total 196610 system.l2.overallHits::total 45
- L1D=32kB, L2=512kB
simSeconds 0.063785 system.cpu.dcache.overallMisses::total 196608 system.cpu.dcache.overallHits::total 13434883 system.l2.overallMisses::total 196610 system.l2.overallHits::total 45
Statistics are exactly the same for bot configurations, as the L2 size is not considered. Moreover, of all the misses in L1, only 45 correspond to hits in L2. I don't think it's normal behaviour.
Any suggestion on how to fix this problem? Maybe it is some port connection that I need to implement in configuration file.
Benchmark
I tested cache configurations with a simple "vector addition" example:
#include <iostream>
#define LKMC_M5OPS_DUMPSTATS __asm__ __volatile__ (".word 0x040F; .word 0x0041;" : : "D" (0), "S" (0) :)
#define LKMC_M5OPS_RESETSTATS __asm__ __volatile__ (".word 0x040F; .word 0x0040;" : : "D" (0), "S" (0) :)
int main(int argc, char* argv[]) {
int N = 1024 * 1024 * 1;
float* A = new float[N];
float* B = new float[N];
float* C = new float[N];
LKMC_M5OPS_RESETSTATS;
for(auto i = 0; i < N; i )
{
C[i] = A[i] B[i];
}
LKMC_M5OPS_DUMPSTATS;
delete[] A;
delete[] B;
delete[] C;
return 0;
}
gem5 command
For completeness I write the command I used for the simulation:
build/X86/gem5.opt ./configs/example/se.py --cmd=./tests/test-progs/add_vector/add_vector --cpu-type=TimingSimpleCPU --caches --l2cache --l1d_size=32kB --l1i_size=32kB --l2_size=[256,512]kB
CodePudding user response:
Those results look normal for this microbenchmark.
Your arrays are much bigger than L2 cache, and you only traverse them once. Loads or stores either hit in L1d (in a line that's already been accessed) or miss all the way to DRAM the first time a cache lines is touched. You only make one pass over the arrays, so there's zero reuse.
Unless you're simulating an L2 with a hardware prefetcher, there's no way L2 can have a line ready when you haven't touched it yet. Even then, with SIMD vectorization this probably runs faster than HW prefetch could keep up with. L2 hits might happen on a rare conflict miss in L1d.
If you want to see L2 matter, maybe put a repeat loop around the array-sum inner loop, and use arrays larger than L1d but small enough for all 3 to fit in L2. Or anything else with some locality over a larger time/space scale than L1d can handle.
Your current test only has spatial locality within one single cache lines, never coming back to them later. So at most you have 3 to 6 cache lines needing to be hot at any one time to achieve the best case of hits for accesses to later floats in the same line.