Home > Mobile >  How smaller block size in cache efficient algorithm (blocking algorithm) results in higher speedup?
How smaller block size in cache efficient algorithm (blocking algorithm) results in higher speedup?

Time:10-24

I am doing some testing on cache efficient algorithms for matrix transposition and in the loop blocking technique I get a higher speed up when I have a smaller block size. Shouldn't larger block size result in higher speed up since we reduce the number of memory access by bringing more data into the cache?

here is the algorithm:

for (int i = 0; i < n; i  = blocksize) {
    for (int j = 0; j < n; j  = blocksize) {
        for (int k = i; k < i   blocksize;   k) {
            for (int l = j; l < j   blocksize;   l) {
                dst[k   l*n] = src[l   k*n];
            }
        }
    }
} 

CodePudding user response:

Well, it depends, though it is often true.

In your (transposition) code, the access is done contiguously for src but with a stride for dst. When the block size is small, the cache lines of dst can be reused in the cache so there is not much loss. However, when the block size is big, a lot of cache lines needs to be fetched for k=0 and they may not fit all in the cache regarding the stride. Indeed, caches are typically N-way set-associative and using a big power-of-two stride tends to cause cache trashing. In the worst case, only few cache lines can be kept (L1 cache are typically 4 or 8 way-associative). In this case, for k=1, all cache lines needs to be fetched again due to conflict misses. A good solution to avoid this issue is to add some padding so to strongly reduce conflict misses. An alternative solution is to make a copy so to avoid reading/writing data too much with a huge stride. Alternatively, non-temporal instructions could help a bit too. This related post provides some details insights about such a problem.

  • Related