I wrote a toy program that compares the performance of two very similar functions. The whole file (minus a couple of macros) looks like this:
constexpr int iterations = 100000;
using u64 = uint64_t;
// Slow.
void test1()
{
u64 sum = 0;
for (int i = 0; i < iterations; i ) {
for (int j = 0; j < 4; j )
sum = i j;
doNotOptimize(sum);
}
}
// Fast.
void test2()
{
u64 sum = 0;
for (int i = 0; i < iterations; i ) {
for (int j = 0; j < 10; j )
sum = i j;
doNotOptimize(sum);
}
}
void driver()
{
int runs = 1000;
BENCH(runs, test1);
BENCH(runs, test2);
}
I am measuring 1000 executions of each function using __rdtsc
and computing the average. With this formula, I am seeing a performance difference of ~172,000 (ticks?) between test1
and test2
. What's surprising is that test2
is the one that's faster. An exotic little detail is that the only magic numbers for which test1
is slower are 4, 8, and 16. If I change the internal loop's condition to j < x
where x
is anything but those 3 numbers, performance match up.
Analyzing the assembly, I am observing that the inner loops in both functions are eliminated and replaced by a few arithmetic operations done as operands of lea
. So in this case, it would make sense if both functions ran at the same speed. But that's not at all what's happening. Here's a Compiler Explorer disassembly and the program source in its entirety: https://godbolt.org/z/d5PsG4YeY
So what's really going on? Is something wrong with my measurements?
Execution environment:
Processor: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (Kaby Lake)
L1 Cache: 128Kib
OS: Linux (64-bit)
Compiler: GCC Suit, Version 10.3.0
Compiler Options: -O3 -fno-tree-vectorize
Thank you!
CodePudding user response:
4 and 8 are scale factors that x86 addressing modes support, tempting GCC into using a slow-LEA on the critical path dependency chain when adding 4*i
or 8*i
to the sum
along with the constant sum of 1..4
or 1..8
(either way just a constant, irrelevant what it is). Apparently also as part of the dep chain for 16. And you used inline asm to force the sum
dep chain to include a store/reload.
Analyzing the assembly, I am observing that the inner loops in both functions are eliminated and replaced by a few arithmetic operations done as operands of lea. So in this case, it would make sense if both functions ran at the same speed.
They're both fast, but the different multiples of i
take different instructions. So there's little reason to assume they'd be the same. The interesting thing here is the one with more total instructions is faster, because it has a shorter dependency chain through sum
.
And you forced a store/reload of it with the somewhat-clunky asm volatile("" :: "g"(&sum) : "memory")
, instead of just requiring the compiler to have the value in a register with asm volatile("" : " r"(sum))
. So that dep chain includes store-forwarding latency (typically 3 to 5 cycles) so it's a bigger bottleneck than front-end or ALU throughput of the independent work.
test1():
xor eax, eax # i = 0
xor ecx, ecx # sum = 0
lea rsi, [rsp-8] # operand for inline asm
jmp .L3
.L5:
mov rcx, QWORD PTR [rsp-8] # load sum
mov rax, rdx # i = i 1
.L3:
lea rdx, [rax 1] # i 1, leaving the old value in RAX
lea rax, [rcx 6 rax*4] # sum = 4*i 6. (1 2 3)
mov QWORD PTR [rsp-8], rax # store sum
cmp rdx, 100000
jne .L5
ret
An LEA with 3 components (two
operations) in the addressing mode has 3 cycle latency on your Skylake-derived CPU. See x86 Multiplication with 3: IMUL vs SHL ADD
So the loop-carried dependency chain is a slow-LEA (3 cycles) between load/store.
test2():
xor eax, eax
xor ecx, ecx
lea rdi, [rsp-8] # operand for inline asm
jmp .L8
.L9:
mov rcx, QWORD PTR [rsp-8] # load sum
mov rax, rdx
.L8:
lea rsi, [rax 36 rax*8]
lea rdx, [rax 1]
lea rax, [rax 9 rsi] # prepare some stuff to be added
add rax, rcx # first use of the RCX load result (sum)
mov QWORD PTR [rsp-8], rax # store sum
cmp rdx, 100000
jne .L9
ret
So the loop-carried dep chain through the store/reload only includes an add
, 1 cycle latency instead of 3.
I assume your performance ratios were something like 3:4 or so, from the 5 1 cycles (6) vs. 5 3 (8) cycles.
See What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for more details.
The compiler could have spent more instructions in test1
to reduce the critical path latency to 1 cycle there, instead of folding more work into an lea
on the critical path. For a loop running 100k iterations, this is pretty much a missed optimization. But I assume the optimizer isn't tuned for artificially-induced spill/reload from inline asm; normally it would only have to introduce that store/reload if it ran out of registers from doing a lot of stuff in a loop, and that many different values would usually mean there was some instruction-level parallelism.
GCC makes better asm from simpler source on Quick-bench
@TedLyngmo linked the code on Quick-bench without -fno-tree-vectorize
, and using benchmark::DoNotOptimize(sum);
which only forces GCC to materialize the value in a register, without blocking constant-propagation through it, or as many other optimizations. Unlike taking its address and telling GCC that was potentially globally visible, like the current custom asm.
The inner loop body is just add %rdx,%rax
/ add $0x4,%rdx
(and cmp rdx jne as the loop branch), if you look at the asm on Quickbench. Or with rdx =10 for the other loop. So same loops, different constants. Same performance, of course.
The current source is essentially compiling to asm that does
for (int i = 0 ; i<iterations ; i ){
sum = 8*i 1234;
force_store_reload(&sum);
}
But if you actually write it that way (https://godbolt.org/z/4ja38or9j), we get asm like quick-bench, except for keeping the value in memory instead of a register. (So about 6x slower.)
.L6:
add QWORD PTR [rsp-8], rax # memory destination add
add rax, 4
cmp rax, 401234
jne .L6
It seems to be a missed-optimization bug that GCC doesn't compile your existing source to that. Specifically, missing the strength-reduction from 8*i
re-evaluated for every i
into tmp = 8
.
BTW, it looks like omitting -fno-tree-vectorize
makes your original test1()
compile even worse. It starts without jumping into the middle of the loop, but it has a longer dep chain.
#gcc 10.3 -O3 (without -fno-tree-vectorize)
test1_orig():
mov QWORD PTR [rsp-8], 6 # init sum
lea rsi, [rsp-8] # operand for inline asm
mov eax, 1
.L2:
mov rdx, QWORD PTR [rsp-8] # load sum
mov rcx, rax
add rdx, rax # used: 1 cycle latency
add rax, 1
add rdx, rax # used: 1 cycle latency
lea rdx, [rdx 5 rcx*2] # used: 3 cycle latency
mov QWORD PTR [rsp-8], rdx # store
cmp rax, 100000
jne .L2
ret