Consider the following:
ammarfaizi2@integral:/tmp$ vi test.c
ammarfaizi2@integral:/tmp$ cat test.c
extern void use_buffer(void *buf);
void a_func(void)
{
char buffer[4096];
use_buffer(buffer);
}
__asm__("emit_mov_rbp_to_rsp:\n\tmovq %rbp, %rsp");
ammarfaizi2@integral:/tmp$ clang -Wall -Wextra -c -O3 -fno-omit-frame-pointer test.c -o test.o
ammarfaizi2@integral:/tmp$ objdump -d test.o
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <emit_mov_rbp_to_rsp>:
0: 48 89 ec mov %rbp,%rsp
3: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
a: 00 00 00
d: 0f 1f 00 nopl (%rax)
0000000000000010 <a_func>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: 48 81 ec 00 10 00 00 sub $0x1000,%rsp
1b: 48 8d bd 00 f0 ff ff lea -0x1000(%rbp),%rdi
22: e8 00 00 00 00 call 27 <a_func 0x17>
27: 48 81 c4 00 10 00 00 add $0x1000,%rsp
2e: 5d pop %rbp
2f: c3 ret
ammarfaizi2@integral:/tmp$
At the end of a_func()
, before return, it's the epilogue function to restore %rsp
. It uses add $0x1000, %rsp
which yields 48 81 c4 00 10 00 00
.
Can't it just use mov %rbp, %rsp
which only yields 3 bytes 48 89 ec
?
Why clang doesn't use the shorter way (mov %rbp, %rsp
)?
With code size trade off, what is the advantage of using add $0x1000, %rsp
instead mov %rbp, %rsp
?
Update (extra)
Even with -Os
, it still results in the same code. So I think there must be a rational reason to avoid mov %rbp, %rsp
.
ammarfaizi2@integral:/tmp$ clang -Wall -Wextra -c -Os -fno-omit-frame-pointer test.c -o test.o
ammarfaizi2@integral:/tmp$ objdump -d test.o
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <emit_mov_rbp_to_rsp>:
0: 48 89 ec mov %rbp,%rsp
0000000000000003 <a_func>:
3: 55 push %rbp
4: 48 89 e5 mov %rsp,%rbp
7: 48 81 ec 00 10 00 00 sub $0x1000,%rsp
e: 48 8d bd 00 f0 ff ff lea -0x1000(%rbp),%rdi
15: e8 00 00 00 00 call 1a <a_func 0x17>
1a: 48 81 c4 00 10 00 00 add $0x1000,%rsp
21: 5d pop %rbp
22: c3 ret
ammarfaizi2@integral:/tmp$
CodePudding user response:
If it's using RBP as a frame pointer at all, yes, mov %rbp, %rsp
would be more compact and AFAIK at least as fast on all x86 microarchitectures. Even moreso when the add constant doesn't fit in an imm8.
This is probably a missed optimization which you should report on https://bugs.llvm.org/enter_bug.cgi, after checking for existing duplicates.
Only reason I can think of for doing it this way is that after calling a function that saves/restores RBP, RBP is a load result. So after mov %rbp, %rsp
, future use of RSP would need to wait for that load. Possibly some corner cases end up bottlenecked on store-fowrwarding latency, vs. register modification just being 1 cycle.
But that seems unlikely to be worth the extra code size in general; I expect such corner cases are rare. Although that new RSP value is needed for a pop %rbp
, so then the caller's restored RBP value is the result of a chain of two loads after we return. (Fortunately ret
has branch prediction to hide latency.)
So it might be worth trying both ways in some benchmarks; e.g. comparing this vs. a tweaked version of LLVM on some standard benchmarks like SPECint.