Is there a reason for compiling an indirect jump as two instructions instead of one on ARM?-CodePudding

Given the following tiny program:

#include <stdlib.h>

#define NEXT goto **ip  
#define guard(n) asm("#" #n)

int main() {
  static void  *prog[] = {&&next1,&&next2,&&next1,&&next3,&&next1,&&next4,&&next1,&&next5,&&next1,&&loop};
  void ** ip=prog;
  int    count = 100000000;
  NEXT;

 next1: guard(1); NEXT;
 next2: guard(2); NEXT;
 next3: guard(3); NEXT;
 next4: guard(4); NEXT;
 next5: guard(5); NEXT;
 loop:
  if (count) {
    count--;
    ip=prog;
    NEXT;
  }
  exit(0);
}

I noticed that each of the next# statements get compiled as TWO instructions.

        ldr     r2, [r3], #4
        mov     pc, r2    @ indirect register jump

I would have expected this to only need one instruction:

        ldr pc, [r3], #4

I found the discussion here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40887

"The problem is that the instruction "ldr pc, [r3, #0]" is not considered a function call by the Cortex-A8's branch predictor, as noted in DDI0344J section 5.2.1, Return stack predictions. Thus, the return from the called function is mispredicted resulting in a penalty of 13 cycles compared to a direct call."

However, the 'goto' is not a function call and there's no expectation for the return stack to be relevant here at all.

I'm wondering if this is some optimization that both GCC and CLANG have missed or is the latter a worse performer for a reason I'm un aware of?

CodePudding user response：

This looks like a missed optimization unless there is some other microarchitectural reason to avoid it on some other CPUs. (That's plausible, but I wouldn't specifically expect it. Loading into a register multiple instructions earlier could give some time to hide load-use latency and reduce possible mispredict penalty, but loading in the previous instruction is unlikely to matter unless there's something special about ldr into PC.)

You're correct, bug #40887 is only about indirect calls with blx vs. manually setting up a return address and jumping. It's not relevant to indirect jumps inside a function, like for a switch or computed goto. (Except perhaps if GCC is avoiding loads into PC in general, so this missed optimization is collateral damage from fixing that bug.)

And you're not using volatile, another thing that often makes GCC do a load with a separate instruction instead of folding it into something else (like avoiding x86 add eax, [rdi]. Or in this case a memory-source jump like ARM load-into-PC, it might consider that special.)

Comparing -mcpu=cortex-a8 -marm vs. -mthumb, we see GCC does need extra instructions in thumb mode to set the low bit of the target address before mov pc,reg. https://godbolt.org/z/87EszvWP1

Or maybe that's a missed optimization, too: just ldr pc, [mem] would stay in the current mode, and we know we're jumping within a single function so there's no possibility of changing mode. And/or the jump table could just have been built with the low bits already set if using bx r2 is actually faster.

https://developer.arm.com/documentation/dui0473/m/arm-and-thumb-instructions/ldr--register-offset- says

For word loads, Rt can be the PC. A load to the PC causes a branch to the address loaded. In ARMv4, bits[1:0] of the address loaded must be 0b00. In ARMv5T and above, bits[1:0] must not be 0b10, and if bit[0] is 1, execution continues in Thumb state, otherwise execution continues in ARM state.

In Thumb mode, ldr into PC is only possible with a 32-bit instruction, but ldr into r0-7 and branching to it can each be 16-bit instructions. But I doubt that would be any better unless you can schedule the load earlier.