The line of code
next = val;
declines performance to 10x, I have checked ASM code, not result.
Why this line of code declines performance to 10x?
Here is the result:
➜ ~ clang-13 1.c -O3
➜ ~ ./a.out
rand_read_1
sum = 2624b18779c40, time = 0.19s
rand_read_2
sum = 2624b18779c40, time = 1.24s
CPU: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
Here is the code:
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <unistd.h>
#include <string.h>
#include <assert.h>
#include <stdlib.h>
#define CCR_MULTIPLY_64 6364136223846793005
#define CCR_ADD_64 1
static inline uint64_t my_rand64(uint64_t *r)
{
*r = *r * CCR_MULTIPLY_64 CCR_ADD_64;
return *r;
}
#define NUM 10000000UL
uint64_t rand_read_1(uint64_t *ptr, uint64_t nr_words)
{
uint64_t i, next, val = 0;
uint64_t sum;
next = 0;
sum = 0;
for (i = 0; i < NUM; i ) {
my_rand64(&next);
next %= nr_words;
val = ptr[next];
sum = val ^ next;
// printf("next1:%ld\n", next);
}
return sum;
}
uint64_t rand_read_2(uint64_t *ptr, uint64_t nr_words)
{
uint64_t i, next, val ,next2 = 0;
uint64_t sum;
next = 0;
sum = 0;
for (i = 0; i < NUM; i ) {
my_rand64(&next);
next %= nr_words;
val = ptr[next];
sum = val ^ next;
next = val;
}
return sum;
}
#define SIZE (1024*1024*1024)
static uint64_t get_ns(void)
{
struct timespec val;
uint64_t v;
int ret;
ret = clock_gettime(CLOCK_REALTIME, &val);
if (ret != 0) {
perror("clock_gettime");
exit(1);
}
v = (uint64_t) val.tv_sec * 1000000000LL;
v = (uint64_t) val.tv_nsec;
return v;
}
int main(int argc, char *argv[])
{
uint64_t *ptr;
uint64_t sum;
uint64_t t0, t1, td, t2;
ptr = (uint64_t *)malloc(SIZE);
assert(ptr);
memset(ptr, 0, SIZE);
t0 = get_ns();
printf("rand_read_1\n");
sum = rand_read_1(ptr, SIZE/8);
t1 = get_ns();
td = t1 - t0;
printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
printf("rand_read_2\n");
sum = rand_read_2(ptr, SIZE/8);
t2 = get_ns();
td = t2 - t1;
printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
return 0;
}
CodePudding user response:
The method of benchmarking is a bit dodgy, but this is a real effect.
next = val;
changes something fundamental about the structure of the code: it makes each memory read depend on the result of the previous read. Without that line, the reads are independent (there is a shorter loop-carried dependency chain through my_rand64
, which the memory read is not a part of).
Essentially with that line it's a latency benchmark, and without that line, it's a throughput benchmark. Latency and throughput differing by a factor of 10 is reasonable for memory reads.
At the assembly level, without that line the asm looks like this when compiled with Clang
.LBB2_3: # =>This Inner Loop Header: Depth=1
imul rcx, r15
add rcx, 1
mov edx, ecx
and edx, 134217727
xor rdx, qword ptr [r14 8*rdx]
mov esi, r15d
imul esi, ecx
add rdx, rbx
add esi, 1
and esi, 134217727
mov rbx, qword ptr [r14 8*rsi]
xor rbx, rsi
add rbx, rdx
mov rcx, rsi
add rax, -2
jne .LBB2_3
uiCA estimates 9.16 cycles per iteration (the loop was unrolled by a factor of 2, so this corresponds to about 4.5 cycles per iteration of the original loop), but it does not take cache misses into account.
With that line, the assembly looks nearly the same, but that doesn't mean it runs in nearly the same way:
.LBB2_6: # =>This Inner Loop Header: Depth=1
imul ecx, r15d
add ecx, 1
and ecx, 134217727
mov rdx, qword ptr [r14 8*rcx]
mov rsi, rcx
xor rsi, rdx
add rsi, rbx
add edx, ecx
imul edx, r15d
add edx, 1
and edx, 134217727
mov rcx, qword ptr [r14 8*rdx]
mov rbx, rdx
xor rbx, rcx
add rbx, rsi
add rdx, rcx
mov rcx, rdx
add rax, -2
jne .LBB2_6
Now uiCA estimates 24.11 cycles per iteration (this loop was also unrolled by 2x), again without taking cache misses into account.
CodePudding user response:
Some notes regarding how to not do benchmarking (benchmarking is very hard):
malloc
is very expensive and you allocate a lot of memory here. However, the OS may not actually allocate it before it notices that the memory is being used. So unless you force the OS to actually allocate the memory before the benchmarking starts, you'll be benchmarking how slowmalloc
is and nothing else, which is a gigantic performance killer next to your little algorithm.Additionally, you may allocate too much memory for your OS to handle and you also do not necessary allocate aligned multiples of
sizeof(uint64_t)
which is a bug.You could do something like this instead (might have to reduce SIZE first):
ptr = (uint64_t *)malloc(SIZE * sizeof(uint64_t)); assert(ptr); for(size_t i=0; i<SIZE; i ) { volatile uint64_t tmp = 123; // some garbage value ptr[i] = tmp; // the compiler has to load 123 from stack to heap } // actual heap allocation should now be done
The
memset
to zero probably does not count as "touching" the heap memory and potentially the compiler might even optimize that out by swapping tocalloc
. When I'm checking the optimized code (gcc x86_64 Linux) this is indeed what is happening. So this does not achieve the "touching the heap" thing described above.printf
and thestdout
buffers are performance-heavy functions that should never be placed inside the benchmarking code. You might just end up benchmarking how slowprintf
is. For example changing your code toprintf("rand_read_1\n"); t0 = get_ns(); sum = rand_read_1(ptr, SIZE/8); t1 = get_ns();
gave vastly different results.
The last
td = t2 - t1;
is nonsense since you measure every crap unrelated to your algorithm that has happened since last measurement. Including any number of context switches by the OS...
With all these bug fixes applied, main() might look like this instead:
int main(int argc, char *argv[])
{
uint64_t *ptr;
uint64_t sum;
uint64_t t0, t1, td, t2;
ptr = malloc(SIZE * sizeof(uint64_t));
assert(ptr);
for(size_t i=0; i<SIZE; i )
{
volatile uint64_t tmp = 123; // some garbage value
ptr[i] = tmp; // the compiler has to load 123 from stack to heap
}
printf("rand_read_1\n");
t0 = get_ns();
sum = rand_read_1(ptr, SIZE/8);
t1 = get_ns();
td = t1 - t0;
printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
printf("rand_read_2\n");
t0 = get_ns();
sum = rand_read_2(ptr, SIZE/8);
t1 = get_ns();
td = t1 - t0;
printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
return 0;
}
In addition, I would perhaps also advise to execute method 1, then method 2, then method 1 then method 2, to ensure that the benchmarking isn't biased to the first algorithm loading values to cache and the second reusing them, or biased towards the eventual context switch that will happen x time units after launching your program.
From there on you can start to measure and consider the performance of your actual algorithm, which is not what you are currently doing. I would suggest that you post a separate question about that.