Inline assembly array sum benchmark near-zero time for large arrays with optimization enabled, even-CodePudding

I have written two functions that gets the sum of an array, the first one is written in C and the other is written with inline assembly (x86-64), I compared the performance of the two functions on my device.

If the -O flag is not enabled during compilation the function with inline assembly is almost 4-5x faster than the C version.

cpp time : 543070068 nanoseconds
cpp time : 547990578 nanoseconds

asm time : 185495494 nanoseconds
asm time : 188597476 nanoseconds

If the -O flag is set to -O1 they produce the same performance.

cpp time : 177510914 nanoseconds
cpp time : 178084988 nanoseconds

asm time : 179036546 nanoseconds
asm time : 181641378 nanoseconds

But if I try to set the -O flag to -O2 or -O3 I'm getting an unusual 2-3 digit nanoseconds performance for the function written with inline assembly which is sketchy fast (at least for me, please bear with me since I have no rock solid experience with assembly programming so I don't know how fast or how slow it can be compared to a program written in C . )
```
cpp time : 177522894 nanoseconds
cpp time : 183816275 nanoseconds

asm time : 125 nanoseconds
asm time : 75 nanoseconds
```

My Questions

Why is this array sum function written with inline assembly so fast after enabling -O2 or -O3?
Is this a normal reading or there is something wrong with the timing/measurement of the performance?
Or maybe there is something wrong with my inline assembly function?
And if the inline assembly function for the array sum is correct and the performance reading is correct, why does the C compiler failed to optimize a simple array sum function for the C version and make it as fast as the inline assembly version?

I have also speculated that maybe the memory alignment and cache misses are improved during compilation to increase the performance but my knowledge on this one is still very very limited.

Apart from answering my questions, if you have something to add please feel free to do so, I hope somebody can explain, thanks!

[EDIT]

So I have removed the use of macro and isolated running the two version and also tried to add volatile keyword, a "memory" clobber and " &r" constraint for the output and the performance was now the same with the cpp_sum.

Though if I remove back the volatile keyword and "memory" clobber it I'm still getting those 2-3 digit nanoseconds performance.

code:

#include <iostream>
#include <random>
#include <chrono>

uint64_t sum_cpp(const uint64_t *numbers, size_t length) {
    uint64_t sum = 0;
    for(size_t i=0; i<length;   i) {
        sum  = numbers[i];
    }
    return sum;
}

uint64_t sum_asm(const uint64_t *numbers, size_t length) {
    uint64_t sum = 0;
    asm volatile(
        "xorq %%rax, %%rax\n\t"
        "%=:\n\t"
        "addq (%[numbers], %%rax, 8), %[sum]\n\t"
        "incq %%rax\n\t"
        "cmpq %%rax, %[length]\n\t"
        "jne %=b"
        : [sum]" &r"(sum)
        : [numbers]"r"(numbers), [length]"r"(length)
        : "%rax", "memory", "cc"
    );
    return sum;
}

int main() {
    std::mt19937_64 rand_engine(1);
    std::uniform_int_distribution<uint64_t> random_number(0,5000);

    size_t length = 99999999;
    uint64_t *arr = new uint64_t[length];
    for(size_t i=1; i<length;   i) arr[i] = random_number(rand_engine);

    uint64_t cpp_total = 0, asm_total = 0;

    for(size_t i=0; i<5;   i) {
        auto start = std::chrono::high_resolution_clock::now();
#ifndef _INLINE_ASM
        cpp_total  = sum_cpp(arr, length);
#else
        asm_total  = sum_asm(arr,length);
#endif
        auto end = std::chrono::high_resolution_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start);
        std::cout << "time : " << dur.count() << " nanoseconds\n";
    }

#ifndef _INLINE_ASM
    std::cout << "cpp sum = " << cpp_total << "\n";
#else
    std::cout << "asm sum = " << asm_total << "\n";
#endif

    delete [] arr;
    return 0;
}

CodePudding user response：

Without being asm volatile, your asm statement is a pure function of the inputs you've told the compiler about, which are a pointer, a length, and the initial value of sum=0. It does not include the pointed-to memory because you didn't use a dummy "m" input for that. (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)

Without a "memory" clobber, your asm statement isn't ordered wrt. function calls, so GCC is hoisting the asm statement out of the loop. See How does Google's `DoNotOptimize()` function enforce statement ordering for more details about that effect of the "memory" clobber.

Have a look at the compiler output on https://godbolt.org/z/KeEMfoMvo and see how it inlined into main. -O2 and higher enables -finline-functions, while -O1 only enables -finline-functions-called-once and this isn't static or inline so it has to emit a stand-alone definition in case of calls from other compilation units.

75ns is just the timing overhead of std::chrono functions around a nearly-empty timed region. It is actually running, just not inside the timed regions. You can see this if you single-step the asm of your whole program, or for example set a breakpoint on the asm statement. When doing asm-level debugging of the executable, you could help yourself find it by putting a funky instruction like mov $0xdeadbeef,