Why does clang's epilogue use `add $N, %rsp` instead of `mov %rbp, %rsp` to restore `%rsp`?-CodePudding

Consider the following:

ammarfaizi2@integral:/tmp$ vi test.c
ammarfaizi2@integral:/tmp$ cat test.c

extern void use_buffer(void *buf);

void a_func(void)
{
    char buffer[4096];
    use_buffer(buffer);
}

__asm__("emit_mov_rbp_to_rsp:\n\tmovq %rbp, %rsp");

ammarfaizi2@integral:/tmp$ clang -Wall -Wextra -c -O3 -fno-omit-frame-pointer test.c -o test.o
ammarfaizi2@integral:/tmp$ objdump -d test.o

test.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <emit_mov_rbp_to_rsp>:
   0: 48 89 ec              mov    %rbp,%rsp
   3: 66 2e 0f 1f 84 00 00  cs nopw 0x0(%rax,%rax,1)
   a: 00 00 00 
   d: 0f 1f 00              nopl   (%rax)

0000000000000010 <a_func>:
  10: 55                    push   %rbp
  11: 48 89 e5              mov    %rsp,%rbp
  14: 48 81 ec 00 10 00 00  sub    $0x1000,%rsp
  1b: 48 8d bd 00 f0 ff ff  lea    -0x1000(%rbp),%rdi
  22: e8 00 00 00 00        call   27 <a_func 0x17>
  27: 48 81 c4 00 10 00 00  add    $0x1000,%rsp
  2e: 5d                    pop    %rbp
  2f: c3                    ret    
ammarfaizi2@integral:/tmp$

At the end of a_func(), before return, it's the epilogue function to restore %rsp. It uses add $0x1000, %rsp which yields 48 81 c4 00 10 00 00.

Can't it just use mov %rbp, %rsp which only yields 3 bytes 48 89 ec?

Why clang doesn't use the shorter way (mov %rbp, %rsp)?

With code size trade off, what is the advantage of using add $0x1000, %rsp instead mov %rbp, %rsp?

Update (extra)

Even with -Os, it still results in the same code. So I think there must be a rational reason to avoid mov %rbp, %rsp.

ammarfaizi2@integral:/tmp$ clang -Wall -Wextra -c -Os -fno-omit-frame-pointer test.c -o test.o
ammarfaizi2@integral:/tmp$ objdump -d test.o

test.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <emit_mov_rbp_to_rsp>:
   0:   48 89 ec                mov    %rbp,%rsp

0000000000000003 <a_func>:
   3:   55                      push   %rbp
   4:   48 89 e5                mov    %rsp,%rbp
   7:   48 81 ec 00 10 00 00    sub    $0x1000,%rsp
   e:   48 8d bd 00 f0 ff ff    lea    -0x1000(%rbp),%rdi
  15:   e8 00 00 00 00          call   1a <a_func 0x17>
  1a:   48 81 c4 00 10 00 00    add    $0x1000,%rsp
  21:   5d                      pop    %rbp
  22:   c3                      ret    
ammarfaizi2@integral:/tmp$

CodePudding user response：

If it's using RBP as a frame pointer at all, yes, mov %rbp, %rsp would be more compact and AFAIK at least as fast on all x86 microarchitectures. Even moreso when the add constant doesn't fit in an imm8.

This is probably a missed optimization which you should report on https://bugs.llvm.org/enter_bug.cgi, after checking for existing duplicates.

Only reason I can think of for doing it this way is that after calling a function that saves/restores RBP, RBP is a load result. So after mov %rbp, %rsp, future use of RSP would need to wait for that load. Possibly some corner cases end up bottlenecked on store-fowrwarding latency, vs. register modification just being 1 cycle.

But that seems unlikely to be worth the extra code size in general; I expect such corner cases are rare. Although that new RSP value is needed for a pop %rbp, so then the caller's restored RBP value is the result of a chain of two loads after we return. (Fortunately ret has branch prediction to hide latency.)

So it might be worth trying both ways in some benchmarks; e.g. comparing this vs. a tweaked version of LLVM on some standard benchmarks like SPECint.