In my adventures of experimenting around with the 64-bit ARM architecture, I noticed a peculiar speed difference depending on whether br
or ret
is used to return from a subroutine.
; Contrived for learning/experimenting purposes only, without any practical use
foo:
cmp w0, #0
b.eq .L0
sub w0, w0, #1
sub x30, x30, #4
ret
.L0:
ret ; Intentionally duplicated 'ret'
The intent of this subroutine is to make the caller of foo
"reenter" foo
w0
times by making foo
return to the instruction that called foo
in the first place (i.e. the instruction immediately before the one to which x30
points). With some rough timing, with w0
being some sufficiently high value, it took about 1362 milliseconds on average. Curiously, replacing the first ret
with br x30
makes it run over twice as fast, taking only 550 milliseconds or so on average.
The timing discrepancy goes away if the test is simplified to just repeatedly calling a subroutine with a bare ret
/br x30
. What makes the above contrived subroutine slower with a ret
?
I tested this on some kind of ARMv8.2 (Cortex-A76 Cortex-A55) processor. I'm not sure to what extent big.LITTLE would mess with the timings, but they seemed pretty consistent over multiple runs. This is by no means a real [micro]benchmark, but instead a "roughly how long does this take if run N times" thing.
CodePudding user response:
Most modern microarchitectures have a special predictor for call / return, which tend to match up with each other in real programs. (And predicting returns any other way is hard for functions with many call-sites: it's an indirect branch.)
By playing with the return address manually, you're making those return-predictions wrong. So every ret
causes a branch mispredict, except the one where you didn't play with x30
.
But if you use an indirect branch other than the one recognized specifically as a ret
idiom, e.g. br x30
, the CPU uses its standard indirect-branch prediction method, which does well when the br
goes to the same location repeatedly.
A quick google search found some info from ARM for Cortex-R4 about the return-predictor stack on that microarchitecture for 32-bit mode (a 4-entry circular buffer): https://developer.arm.com/documentation/ddi0363/e/prefetch-unit/return-stack
For x86, https://blog.stuffedcow.net/2018/04/ras-microbenchmarks/ is a good article about the concept in general, as well as some details on how various x86 microarchitectures maintain their prediction accuracy in the face of things like mis-speculated execution of a call
or ret
instruction that has to get rolled back.
(x86 has an actual ret
opcode; ARM64 is the same: the ret
opcode is like br
, but with a hint that this is a function-return. Some other RISCs like RISC-V don't have a separate opcode, and just assume that branch-to-register with the link register is a return.)