in this program
#include <stdio.h>
#include <stdint.h>
int main()
{
uint16_t *data=(uint16_t[]){1,2,3,4,5,6,7,8,9,10};
int mlen=10;
uint16_t partial=0;
__builtin_prefetch(data 8);
while (mlen >0) {
partial = *(uint16_t *)data;
data = 1;
mlen -= 1;
}
return 0;
}
I am using __builtin_prefetch(data 8);
so until index 8 will be fetched in cache. But I I compile the program with
gcc prefetcher.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3
it is slower then this
gcc prefetcher.c -o no-prefetch -std=c11 -O3
this is the output respectively
12401 L1-dcache-load-misses # 6.76% of all L1-dcache accesses
183459 L1-dcache-loads
0.000881880 seconds time elapsed
0.000952000 seconds user
0.000000000 seconds sys
and this is without prefetcher
12991 L1-dcache-load-misses # 6.87% of all L1-dcache accesses
189161 L1-dcache-loads
0.001349719 seconds time elapsed
0.001423000 seconds user
0.000000000 seconds sys
What I need to do it correctly so my __builtin_prefetch code run faster
above output is from perf progarm
CodePudding user response:
What I need to do it correctly so my __builtin_prefetch code run faster
You need to remove __builtin_prefetch
. It's literally the only instruction that differs between code snippets. Compiler optimized your whole code to a no-op, as there are no side effects in your code.
Your first code snippet is compiled to:
main:
xor eax, eax
ret
While your second code is compiled to:
main:
xor eax, eax
prefetcht0 [rsp-24]
ret
Even if you do return partial
for example, the compiler is able to calculate the entire result at compile time and reduce the entire program to just return <constant>
.
You can inspect the generated assembly of your programs with ease using https://godbolt.org/ .