Why C-style Arrays performance in O3 is less than no optimization on Quick Bench?-CodePudding

Base on C-style Arrays vs std::vector using std::vector::at, std::vector::operator[], and iterators I run the following benchmarks.

no optimization https://quick-bench.com/q/LjybujMGImpATTjbWePzcb6xyck
O3 https://quick-bench.com/q/u5hnSy90ZRgJ-CQ75b1c1a_3BuY

From here, vectors definitely perform better in O3.

However, C-style Array is slower with -O3 than -O0

C-style (no opt) : about 2500
C-style (O3) : about 3000

I don't know what factors lead to this result. Maybe it's because the compiler is c 14?

(I'm not asking about std::vector relative to plain arrays, I'm just asking about plain arrays with/without optimization.)

CodePudding user response：

Your -O0 code wasn't faster in an absolute sense, just as a ratio against an empty
for (auto _ : state) {} loop.

That also gets slower when optimization is disabled, because the state iterator functions don't inline. Check the asm for your own functions, and instead of an outer-loop counter in %rbx like:

      # outer loop of your -O3 version
       sub    $0x1,%rbx
       jne    407f57 <BM_map_c_array(benchmark::State&) 0x37>

RBX was originally loaded from 0x10(%rdi), from the benchmark::State& state function arg.

You instead get state counter updates in memory, like the following, plus a bunch of convoluted code that materializes a boolean in a register and then tests it again.

# part of the outer loop of your -O0 version
12.50%   mov    -0x8060(%rbp),%rax
25.00%   sub    $0x1,%rax
12.50%   mov    %rax,-0x8060(%rbp)

There are high counts on those instructions because the call map_c_array didn't inline, so most of the CPU time wasn't actually spent in this function itself. But of the time that was, about half was on these instructions. In an empty loop, or one that called an empty function (I'm not sure which Quick Bench is doing), that would still be the case.

Quick Bench does this to try to normalize things for whatever hardware its cloud VM ends up running on, with whatever competing load. Click the "About Quick Bench" in the dropdown at the top right.

And see the label on the graph: CPU time / Noop time. (When they say "Noop", they don't mean a nop machine instruction, they mean in a C sense.)

An empty loop with a loop counter runs about 6x slower when compiled with optimization disabled (bottlenecked on store-to-load forwarding latency of the loop counter), so your -O0 code is "only" a bit less than 6x slower, not exactly 6x slower.

With a counter in a register, modern x86 CPUs can run loops at 1 cycle per iteration, like looptop: dec