Home > front end >  Get G to use a custom calling convention to pass larger structs in registers instead of memory?
Get G to use a custom calling convention to pass larger structs in registers instead of memory?

Time:04-15

Short question: Are there compiler options or functions attributes available in g that force the compiler to pass members of structures through registers instead of the stack.

Long question: In my application I have a list of function handles that I am basically calling in a loop. Since every function does only a small amount of work, the function call overhead needs to be minimized.

I want now to pass the arguments in a struct. This has the advantage, that a change in the arguments needs to be done only in one place not in like 20 places all over the code base. Another advantage is, that some arguments are based on template parameters which add or remove arguments. With the struct this could be overcome.

The problem is now, that if the struct has more than two members, g pushes the struct on the stack instead of passing the arguments in the registers. This causes the performance to go down by 50%. I produced a small example that demonstrates the problem:

#include <iostream>

struct A { 
  uint8_t n;
  size_t& __restrict__ dataPos;
  char* const __restrict__ data;
};

struct B { 
  size_t& __restrict__ dataPos;
  char* const __restrict__ data;
};

__attribute__((noinline)) void funcStructA(A a) {
  std::cout << "out struct A: n: " << a.n << " dataPos: " << a.dataPos << " data: " << a.data << std::endl;
}

__attribute__((noinline)) void funcStructB(uint8_t n, B b) {
  std::cout << "out struct B: n: " << n << " dataPos: " << b.dataPos << " data: " << b.data << std::endl;
}

__attribute__((noinline)) void funcDirect(uint8_t n, size_t& __restrict__ dataPos, char* const __restrict__ data) {
  std::cout << "out direct: n: " << n << " dataPos: " << dataPos << " data: " << data << std::endl;
}

int main(int nargs, char** args) {

  char data[1000];

  size_t pos = 100;

  funcStructA(A{10, pos, data});
  funcStructB(10, B{pos, data});
  funcDirect(10, pos, data);

  return 0;
}

The assembly code (g -std=c 14 -O3, version 11.2.1 20220127 (Red Hat 11.2.1-9)) in main is:

  401119:    push   QWORD PTR [rsp 0x10]
  40111d:    push   QWORD PTR [rsp 0x10]
  401121:    push   QWORD PTR [rsp 0x38]
  401125:    call   401280 <funcStructA(A)>
  40112a:    add    rsp,0x20
  40112e:    mov    rsi,rbp
  401131:    mov    rdx,r12
  401134:    mov    edi,0xa
  401139:    call   4013a0 <funcStructB(unsigned char, B)>
  40113e:    mov    rdx,r12
  401141:    mov    rsi,rbp
  401144:    mov    edi,0xa
  401149:    call   4014c0 <funcDirect(unsigned char, unsigned long&, char*)>

In functStructA the structure is pushed to the stack, for funcStructB the members are passed through the registers.

I tried to move n around in the struct or pass it by reference, but the behavior is always the same.

I read through the attributes available in gnu (https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes, https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes) but could not find one that matches my problem. I tried cdcl, fastcall, ms_abi but this changed not that much.

Passing the structure by reference causes the same problems.

clang seems to have the same problem. I will run a test in the next days.

Any help would be appreciated.

CodePudding user response:

You could pass the uint8_t or one of the pointers as a separate arg to describe what you want to the compiler, or stuff it into one of the existing 64-bit members (see below).


Unfortunately no, there aren't compiler options that tweak the C ABI / calling-convention rules to pass structs larger than 16 bytes in registers on x86-64 or other ISAs. The x86-64 System V ABI doesn't do that, and there isn't another calling convention GCC knows about which does. The Windows x64 ABI only passes up to 8-byte objects in registers, not even 16.

Also, you can't override the C ABI rule that non-trivially-copyable objects (or whatever the exact criterion is) are passed in memory so they always have an address. (e.g. by value on the stack in x86-64 System V.)

The only options I know of that modify the calling convention are -mabi=ms or whatever to select an existing calling convention GCC knows about. Or ones that affect whether certain registers are call-preserved or call-clobbered, like -fcall-used-reg (GCC manual) and some ABI-affecting options like -fpack-struct[=n] that aren't specifically about the calling convention. (And no, -fpack-struct wouldn't help. Bringing sizeof(A) down from 24 to 17 doesn't let it fit in 2 regs.

In theory with -fwhole-program or maybe -flto, GCC could invent custom calling conventions, but AFAIK it doesn't. It can take advantage of the fact that another function doesn't clobber certain registers, in terms of inter-procedural optimization (IPO) other than inlining, but not changing how args are passed.

The normal way to handle calling-convention overhead is to make sure small functions inline (e.g. by compiling with -flto to allow cross-file inlining), but this doesn't work if you're taking function pointers or using virtual functions.


It's not number of members, it's total size, so the x32 ABI (with 32-bit pointers/references and size_t) would be able to pass / return that struct packed into two registers. g -O3 -mx32.

(x86-64 SysV packs aggregates into up-to-2 registers using the same layout it would in memory, so smaller members means more member fit in 16 bytes.)


Or if you can settle for having a 32-bit size by value, or 48-bit size, you could pack the uint8_t into the upper byte of a uint64_t, or even use bitfield members. But since you have a level of indirection (a reference member) for size_t& __restrict__ dataPos;, that member is basically another pointer; using uint32_t& there wouldn't help since a pointer is still 64 bits. I assume you need that to be a reference for some reason.

You could pack your uint8_t into the upper byte of a pointer. Upcoming HW will have an option to optimize this, ignoring high bits instead of enforcing correct sign-extension from 48-bit or 57-bit. Otherwise you just manually do that with shifts and & with uintptr_t: Using the extra 16 bits in 64-bit pointers

  • Related