Indexing with modulo has a huge performance hit-CodePudding

I have a simple code that sums elements from an array and returns them:

// Called with jump == 0
int performance(int jump, int *array, int size) {
  int currentIndex = 0;
  int total = 0;
  // For i in 1...500_000_000
  for (int i = 0; i < 500000000; i  ) {
    currentIndex = (currentIndex   jump) % size;
    total  = array[currentIndex];
  }
  return total;
}

I noticed a weird behavior: the presence of % size has a very large performance impact (~10x slower) even tho jump is 0 so it is constantly accessing the same array element (0). Just removing % size improves performance a lot.

I would have thought this was just the modulo computation that was making this difference, but now say I replace my sum line with total = array[currentIndex] % size; (thus also computing a modulo) the performance difference is almost unnoticeable.

I am compiling this with -O3 with clang on an arm64 machine.

What could be causing this?

CodePudding user response：

Sounds normal for sdiv msub latency to be about 10x add latency.

Even if this inlined for a compile-time-constant size that wasn't a power of two, that's still a multiplicative inverse and an msub (multiply-subtract) to get the remainder, so a dep chain of at least two multiplies and a shift.

Maybe an extra few instructions on the critical path for a signed remainder with with a constant size (even if positive) since the array is also signed int. e.g. -4 % 3 has to produce -1 in C.

See

say I replace my sum line with total = array[currentIndex] % size; (thus also computing a modulo)

That remainder isn't part of a loop-carried dependency chain. (https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/)

Multiple remainder calculations can be in flight in parallel, since the next array[idx] load address only depends on a = jump add instruction.

If you don't bottleneck on throughput limits, those remainder results could potentially be ready with 1/clock throughput, with OoO exec overlapping dep chains between iterations. The only latency bottlenecks are the loop counter/index and total = ..., both of which are just integer add which has 1 cycle latency.

So really, the bottleneck is likely going to be on throughput (of the whole loop body), not those latency bottlenecks, unless you're testing on an extremely wide CPU that can get a lot done every cycle. (Surprised you don't get more slowdown from introducing the % at all. Unless total is getting optimized away after inlining if you're not using the result.)