I may be misunderstanding how cache fetches work, but I'm curious if there are any compiler optimizations for aligning small functions that are not inlined.
If the cache-line-size is 64 bytes on a given machine, would it make sense to have function pointers to functions that are smaller than 64 bytes be aligned within a single cache line to prevent multiple cache fetches to retrieve the function?
Even if the function is 100 bytes in size, it could still be aligned to 2 cache lines, with the worst case being 3 if it is unaligned. Is this a viable optimization and do compilers use anything like this in real applications such as packing small commonly used functions together?
CodePudding user response:
No, mainstream compilers like gcc and clang don't leave extra unused space to start a small function at the start of a cache line, to avoid having its end cross a boundary. Nor do they choose an order within the text section that optimizes for this without reducing overall code density for I-cache and iTLB.
AFAIK, GCC doesn't even know instruction sizes; it compiles by emitting asm text to be assembled separately. Before every function, it uses .p2align 4
(assuming the default is -falign-functions=16
like on x86-64), or clang -mllvm -align-all-functions=4
(2^4 = 16), since CPUs often fetch in chunks that size, and you want the first aligned fetch to pull in multiple useful instructions.
Inside functions, GCC by default aligns branch targets (or at least tops of loops) by padding to the next multiple of 16 if it would take 10 or fewer bytes, then unconditionally align by 8, but the condition is implemented by the assembler (which does know machine-code sizes / position):
.p2align 4,,10 # GCC does this for loop tops *inside* functions
.p2align 3
Interesting idea, though, might be worth looking at whether there are any real-world benefits to doing this.
Most frequently-called functions are already hot in some level of cache (because caches work, and being frequently called means they tend to stay hot), but this possibly could reduce the number of cache lines that need to stay hot.
Another thing to consider is that for many functions, not all the code bytes are hot. e.g. the fast path might be in the first 32 bytes, with later code bytes only being if()
or else
blocks for error conditions or other special cases. (Figuring out which path through the function is the common one is part of a compiler's job, although profile-guided optimization (PGO) can help. Or hinting with C 20 [[likely]]
/ [[unlikely]]
can achieve the same result if the hints are actually correct, letting the compiler lay out the code so the fast path(s) minimizes taken branches and maximizes cache locality. How do the likely/unlikely macros in the Linux kernel work and what is their benefit? has an example using GNU C __builtin_expect()
.
Sometimes these later parts of a function will jump back to the "main" path of the function when they're done, other times they independently end with their own ret
instruction. (This is called "tail duplication", optimizing away a jmp
by duplicating the epilogue if any.)
So if you were to blindly assume that it matters to have the whole function in the same cache line, but actually only the first 32 bytes typically execute on most calls, you could end up displacing the start of a later somewhat larger function so it starts closer to the end of a cache line, maybe without gaining anything.
So this could maybe be good with profile-guided optimization to figure out which functions were actually hot, and group them adjacent to each other (for iTLB and L1i locality), and sort them so they pack nicely. Or which functions tend to be called together, one after the other.
And conversely, to group functions that often go unused for a long time together, so those cache lines can stay cold (and even iTLB entries if there are pages of them).