I have a simple code that sums elements from an array and returns them:
// Called with jump == 0
int performance(int jump, int *array, int size) {
int currentIndex = 0;
int total = 0;
// For i in 1...500_000_000
for (int i = 0; i < 500000000; i ) {
currentIndex = (currentIndex jump) % size;
total = array[currentIndex];
}
return total;
}
I noticed a weird behavior: the presence of % size
has a very large performance impact (~10x slower) even tho jump
is 0
so it is constantly accessing the same array element (0). Just removing % size
improves performance a lot.
I would have thought this was just the modulo computation that was making this difference, but now say I replace my sum line with total = array[currentIndex] % size;
(thus also computing a modulo) the performance difference is almost unnoticeable.
I am compiling this with -O3 with clang on an arm64 machine.
What could be causing this?
CodePudding user response:
Sounds normal for sdiv
msub
latency to be about 10x add
latency.
Even if this inlined for a compile-time-constant size
that wasn't a power of two, that's still a multiplicative inverse and an msub
(multiply-subtract) to get the remainder, so a dep chain of at least two multiplies and a shift.
Maybe an extra few instructions on the critical path for a signed remainder with with a constant size (even if positive) since the array is also signed int
. e.g. -4 % 3
has to produce -1
in C.
See
- How many CPU cycles are needed for each assembly instruction?
- What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
say I replace my sum line with
total = array[currentIndex] % size;
(thus also computing a modulo)
That remainder isn't part of a loop-carried dependency chain. (https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/)
Multiple remainder calculations can be in flight in parallel, since the next array[idx]
load address only depends on a = jump
add instruction.
If you don't bottleneck on throughput limits, those remainder results could potentially be ready with 1/clock throughput, with OoO exec overlapping dep chains between iterations. The only latency bottlenecks are the loop counter/index and total = ...
, both of which are just integer add
which has 1 cycle latency.
So really, the bottleneck is likely going to be on throughput (of the whole loop body), not those latency bottlenecks, unless you're testing on an extremely wide CPU that can get a lot done every cycle. (Surprised you don't get more slowdown from introducing the %
at all. Unless total
is getting optimized away after inlining if you're not using the result.)