Why one line of code to decline performance 10x?-CodePudding

The line of code

next  = val;

declines performance to 10x, I have checked ASM code, not result.

Why this line of code declines performance to 10x?

Here is the result:

➜  ~ clang-13 1.c -O3
➜  ~ ./a.out
rand_read_1
sum = 2624b18779c40, time = 0.19s
rand_read_2
sum = 2624b18779c40, time = 1.24s

CPU: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz

Here is the code:

#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <unistd.h>
#include <string.h>
#include <assert.h>
#include <stdlib.h>

#define CCR_MULTIPLY_64           6364136223846793005
#define CCR_ADD_64                1
static inline uint64_t my_rand64(uint64_t *r)
{
    *r = *r * CCR_MULTIPLY_64   CCR_ADD_64;
    return *r;
}

#define NUM 10000000UL

uint64_t rand_read_1(uint64_t *ptr, uint64_t nr_words)
{
    uint64_t i, next, val = 0;
    uint64_t sum;

    next = 0;
    sum = 0;
    for (i = 0; i < NUM; i  ) {
        my_rand64(&next);
        next %= nr_words;
        val = ptr[next];
        sum  = val ^ next;
        // printf("next1:%ld\n", next);
    }

    return sum;
}

uint64_t rand_read_2(uint64_t *ptr, uint64_t nr_words)
{
    uint64_t i, next, val ,next2 = 0;
    uint64_t sum;

    next = 0;
    sum = 0;
    for (i = 0; i < NUM; i  ) {
        my_rand64(&next);
        next %= nr_words;
        val = ptr[next];
        sum  = val ^ next;
        next  = val;
    }

    return sum;
}

#define SIZE    (1024*1024*1024)

static uint64_t get_ns(void)
{
    struct timespec val;
    uint64_t v;
    int ret;

    ret = clock_gettime(CLOCK_REALTIME, &val);
    if (ret != 0) {
        perror("clock_gettime");
        exit(1);
    }
    v  = (uint64_t) val.tv_sec * 1000000000LL;
    v  = (uint64_t) val.tv_nsec;
    return v;
}

int main(int argc, char *argv[])
{
    uint64_t *ptr;
    uint64_t sum;
    uint64_t t0, t1, td, t2;

    ptr = (uint64_t *)malloc(SIZE);
    assert(ptr);

    memset(ptr, 0, SIZE);

    t0 = get_ns();
    printf("rand_read_1\n");
    sum = rand_read_1(ptr, SIZE/8);
    t1 = get_ns();
    td = t1 - t0;
    printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
    printf("rand_read_2\n");
    sum = rand_read_2(ptr, SIZE/8);
    t2 = get_ns();
    td = t2 - t1;
    printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
    return 0;
}

CodePudding user response：

The method of benchmarking is a bit dodgy, but this is a real effect.

next = val; changes something fundamental about the structure of the code: it makes each memory read depend on the result of the previous read. Without that line, the reads are independent (there is a shorter loop-carried dependency chain through my_rand64, which the memory read is not a part of).

Essentially with that line it's a latency benchmark, and without that line, it's a throughput benchmark. Latency and throughput differing by a factor of 10 is reasonable for memory reads.

At the assembly level, without that line the asm looks like this when compiled with Clang

.LBB2_3:                                # =>This Inner Loop Header: Depth=1
    imul    rcx, r15
    add     rcx, 1
    mov     edx, ecx
    and     edx, 134217727
    xor     rdx, qword ptr [r14   8*rdx]
    mov     esi, r15d
    imul    esi, ecx
    add     rdx, rbx
    add     esi, 1
    and     esi, 134217727
    mov     rbx, qword ptr [r14   8*rsi]
    xor     rbx, rsi
    add     rbx, rdx
    mov     rcx, rsi
    add     rax, -2
    jne     .LBB2_3

uiCA estimates 9.16 cycles per iteration (the loop was unrolled by a factor of 2, so this corresponds to about 4.5 cycles per iteration of the original loop), but it does not take cache misses into account.

With that line, the assembly looks nearly the same, but that doesn't mean it runs in nearly the same way:

.LBB2_6:                                # =>This Inner Loop Header: Depth=1
    imul    ecx, r15d
    add     ecx, 1
    and     ecx, 134217727
    mov     rdx, qword ptr [r14   8*rcx]
    mov     rsi, rcx
    xor     rsi, rdx
    add     rsi, rbx
    add     edx, ecx
    imul    edx, r15d
    add     edx, 1
    and     edx, 134217727
    mov     rcx, qword ptr [r14   8*rdx]
    mov     rbx, rdx
    xor     rbx, rcx
    add     rbx, rsi
    add     rdx, rcx
    mov     rcx, rdx
    add     rax, -2
    jne     .LBB2_6

Now uiCA estimates 24.11 cycles per iteration (this loop was also unrolled by 2x), again without taking cache misses into account.

CodePudding user response：

Some notes regarding how to not do benchmarking (benchmarking is very hard):

malloc is very expensive and you allocate a lot of memory here. However, the OS may not actually allocate it before it notices that the memory is being used. So unless you force the OS to actually allocate the memory before the benchmarking starts, you'll be benchmarking how slow malloc is and nothing else, which is a gigantic performance killer next to your little algorithm.

Additionally, you may allocate too much memory for your OS to handle and you also do not necessary allocate aligned multiples of sizeof(uint64_t) which is a bug.

You could do something like this instead (might have to reduce SIZE first):
```
ptr = (uint64_t *)malloc(SIZE * sizeof(uint64_t));
assert(ptr);
for(size_t i=0; i<SIZE; i  )
{
  volatile uint64_t tmp = 123; // some garbage value
  ptr[i] = tmp; // the compiler has to load 123 from stack to heap
}
// actual heap allocation should now be done
```
The memset to zero probably does not count as "touching" the heap memory and potentially the compiler might even optimize that out by swapping to calloc. When I'm checking the optimized code (gcc x86_64 Linux) this is indeed what is happening. So this does not achieve the "touching the heap" thing described above.
printf and the stdout buffers are performance-heavy functions that should never be placed inside the benchmarking code. You might just end up benchmarking how slow printf is. For example changing your code to
```
printf("rand_read_1\n");
t0 = get_ns();
sum = rand_read_1(ptr, SIZE/8);
t1 = get_ns();
```
gave vastly different results.
The last td = t2 - t1; is nonsense since you measure every crap unrelated to your algorithm that has happened since last measurement. Including any number of context switches by the OS...

With all these bug fixes applied, main() might look like this instead:

int main(int argc, char *argv[])
{
    uint64_t *ptr;
    uint64_t sum;
    uint64_t t0, t1, td, t2;

    ptr = malloc(SIZE * sizeof(uint64_t));
    assert(ptr);
    for(size_t i=0; i<SIZE; i  )
    {
      volatile uint64_t tmp = 123; // some garbage value
      ptr[i] = tmp; // the compiler has to load 123 from stack to heap
    }

    printf("rand_read_1\n");
    t0 = get_ns();
    sum = rand_read_1(ptr, SIZE/8);
    t1 = get_ns();
    td = t1 - t0;
    printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
    printf("rand_read_2\n");
    
    t0 = get_ns();
    sum = rand_read_2(ptr, SIZE/8);
    t1 = get_ns();
    td = t1 - t0;
    printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
    return 0;
}

In addition, I would perhaps also advise to execute method 1, then method 2, then method 1 then method 2, to ensure that the benchmarking isn't biased to the first algorithm loading values to cache and the second reusing them, or biased towards the eventual context switch that will happen x time units after launching your program.

From there on you can start to measure and consider the performance of your actual algorithm, which is not what you are currently doing. I would suggest that you post a separate question about that.