For context this question is related to the blog post on Cache Processor Effects, specifically Example 1-2.
In the code snippet below, I'm increasing the step size by 2 each time i.e. the number of operations I'm performing is decreasing by a factor of 2 each time. From the blog post, I am expecting to see for step size 1-16, the average time to complete the loop remains roughly the same. The main intuitions discussed by the author were 1) Majority of the time is contributed by memory access (i.e. we fetch then multiply) rather than arithmetic operations, 2) Each time the cpu fetches a cacheline (i.e. 64 bytes or 16 int).
I've tried to replicate the experiment on my local machine with the following code. Note that I've allocated a new int array for every step size so that they do not take advantage of previous cached data. For a similar reason, I've also only "repeated" the inner for loop for each step size only once (instead of repeating the experiment multiple times).
constexpr long long size = 64 * 1024 * 1024; // 64 MB
for (int step = 1; step <= 1<<15 ; step <<= 1) {
auto* arr = new int[size];
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < size; i = step) {
arr[i] *= 3;
}
auto finish = std::chrono::high_resolution_clock::now();
auto microseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
std::cout << step << " : " << microseconds.count() << "ms\n";
// delete[] arr; (updated - see Paul's comment)
}
Result, however, was very different from what was described in the blog post.
Without optimization: clang -g -std=c 2a -Wpedantic -Wall -Wextra -o a cpu-cache1.cpp
1 : 222ms
2 : 176ms
4 : 152ms
8 : 140ms
16 : 135ms
32 : 128ms
64 : 130ms
128 : 125ms
256 : 123ms
512 : 118ms
1024 : 121ms
2048 : 62ms
4096 : 32ms
8192 : 16ms
16384 : 8ms
32768 : 4ms
With -O3 optimization clang -g -std=c 2a -Wpedantic -Wall -Wextra -o a cpu-cache1.cpp -O3
1 : 155ms
2 : 145ms
4 : 134ms
8 : 131ms
16 : 125ms
32 : 130ms
64 : 130ms
128 : 121ms
256 : 123ms
512 : 127ms
1024 : 123ms
2048 : 62ms
4096 : 31ms
8192 : 15ms
16384 : 8ms
32768 : 4ms
Note that I'm running with Macbook Pro 2019 and my pagesize is 4096. From the observation above, it seems like until 1024 step size the time taken roughly remain linear. Since each int is 4 bytes, this seem related to the size of a page (i.e. 1024*4 = 4096) which makes me think this might be some kind of prefetching/page related optimization even when there is no optimization specified?
Does someone have any ideas or explanation on why these numbers are occurring?
CodePudding user response:
In your code, you called new int[size]
which is essentially a wrapper around malloc. The kernel does not immediately allocate a physical page/memory to it due to Linux's optimistic memory allocation strategy (see man malloc).
What happens when you call arr[i] *= 3
is that a page fault will occur if the page is not in the Translation Lookup Buffer (TLB). Kernel will check that your requested virtual page is valid, but an associated physical page is not allocated yet. The kernel will assign your requested virtual page with a physical page.
For step = 1024, you are requesting every page associated to arr
. For step = 2048, you are requesting every other page associated to arr
.
This act of assigning a physical page is your bottleneck (based on your data it takes ~120ms to assign 64mb of pages one by one). When you increase step from 1024 to 2048, now the kernel doesn't need to allocate physical pages for every virtual page associated to arr
hence the runtime halves.
As linked by @Daniel Langr, you'll need to "touch" each element of arr or zero-initialize the arr with new int[size]{}
. What this does is to force the kernel to assign physical pages to each virtual page of arr
.