Function pointer performance; slower on a single call than multiple calls?-CodePudding

I am interested in the execution speed of a function called through a pointer. I found initially that calling a function pointer through a pointer passed in as a parameter is slower than calling a locally declared function pointer. Please see the following code; you can see I have two function calls, both of which ultimately execute a lambda though a function pointer.

#include <chrono>
#include <iostream>

using namespace std;

__attribute__((noinline)) int plus_one(int x) {
  return x   1;
}

typedef int (*FUNC)(int);

#define OUTPUT_TIME(msg) std::cout << "Execution time (ns) of " << msg << ": " << std::chrono::duration_cast<chrono::nanoseconds>(t_end - t_start).count() << std::endl;
#define START_TIMING() auto const t_start = std::chrono::high_resolution_clock::now();
#define END_TIMING(msg) auto const t_end = std::chrono::high_resolution_clock::now(); OUTPUT_TIME(msg);

  auto constexpr g_count = 1000000;
  
__attribute__((noinline)) int speed_test_no_param() {

  int r;

  auto local_lambda = [](int a) {
    return plus_one(a);
  };

  FUNC f = local_lambda;
  START_TIMING();
  for (auto i = 0; i < g_count;   i)
    r = f(100);
  END_TIMING("speed_test_no_param");
  
  return r;
}

__attribute__((noinline)) int speed_test_with_param(FUNC &f) {
  int r;
  START_TIMING();
  for (auto i = 0; i < g_count;   i)
    r = f(100);
  END_TIMING("speed_test_with_param");
  
  return r;
}

int main() {

  int ret = 0;
  auto main_lambda = [](int a) {
    return plus_one(a);
  };

  ret  = speed_test_no_param();
  FUNC fp = main_lambda;
  ret  = speed_test_with_param(fp);

  return ret;
}

Built on Ubuntu 20.04 with:

g   -ggdb -ffunction-sections -O3 -std=c  17   -DNDEBUG=1 -DRELEASE=1  -c speed_test.cpp -o speed_test.o && g   -o speed_test -Wl,-gc-sections    -Wl,--start-group speed_test.o   -Wl,--rpath='$ORIGIN'   -Wl,--end-group

The results were not surprising; for any given number of runs, we see that the version without the parameter is clearly the fastest. Here is just one run; all of the many times I have run this yield the same result:

Execution time (ns) of speed_test_no_param: 74
Execution time (ns) of speed_test_with_param: 1173849

When I dig into the assembly, I found what I believe is the reason for this. The code for speed_test_no_param() is

0x000055555555534b  call 0x555555555310 <plus_one(int)>

...whereas the code for speed_test_with_param is more complicated; a fetch of the address of the lambda, then a jump to the plus_one function:

0x000055555555544e  call QWORD PTR [rbx] 
...
0x0000555555555324  jmp 0x555555555310 <plus_one(int)>

(On compiler explorer at https://godbolt.org/z/b4hqYx7Eo. Different compiler but similar assembly; timing code commented out.)

What I didn't expect though is that when I reduce the number of calls down to 1 from 1000000 (auto constexpr g_count = 1), the results are flipped with the parameter version being the fastest:

Execution time (ns) of speed_test_no_param: 61
Execution time (ns) of speed_test_with_param: 31

I have also run this many times; the parameter version is always the fastest.

I do not understand why this is; I don't now believe a call through a parameter is slower than a local variable due to this conflicting evidence, but looking at the assembly suggests it really should be.

Can someone please explain?

UPDATE

As per the comment below, ordering matters. when I call speed_test_with_param() first, speed_test_no_param() is the fastest of the two! Yet when I call speed_test_no_param() first, speed_test_with_param() is the fastest! Any explanation to this would be greatly appreciated!

CodePudding user response：

You're benchmarking a single iteration, which subjects you to cache effects and other warmup costs. The entire reason we normally run benchmarks several times is to amortize out these kinds of effects.

Caching refers to the memory hierarchy: your actual RAM is significantly slower than your CPU (and disk even more so), so to speed things up your CPU has a cache (often, multiple caches) which stores the most recently accessed bits of memory. The first time you start your program, it will need to be loaded from disk into RAM; thereafter, it will need to be loaded from RAM into the CPU caches. Uncached memory accesses can be orders of magnitudes slower than cached memory accesses. As your program runs, various bits of code and data will be loaded from RAM and cached; hence, subsequent executions of the same bit of code will often be faster than the first execution.

Other effects can include things like lazy dynamic linking and lazy initializations, wherein certain functions will perform extra work the first time they're called (for example, resolving dynamic library loads or initializing static data). These can all contribute to the first iteration being slower than subsequent iterations.

To address these issues, always make sure to run your benchmarks multiple times - and when possible, run your entire benchmark suite a few times in one process and take the lowest (fastest) run.

CodePudding user response：

With multiple loop iterations in the C source, the fast version is only doing one in asm, because you gave the optimizer enough visibility to prove that's equivalent.

Why ordering matters with just one iteration: probably warm-up effects in the library code for std::chrono. Idiomatic way of performance evaluation?

Can you confirm that my suspicion that the call without the parameter technically should be the fastest, because with the parameter involves a memory read to find the location to call?

Much more significant is whether the compiler can constant-propagate the function pointer and see what function is being called; notice how speed_test_with_param has an actual loop that calls g_count times, but speed_test_no_param can see it's calling plus_one. Clang sees through the local lambda and the noinline to notice it has no side-effects, so it only calls it once.

It doesn't inline, but it still does inter-procedural optimization. With GCC, you could block that by using __attribute__((noipa)). GCC's noclone attribute can also stop it from making a copy of the function with constant-propagation into it, but noipa is I think stronger. noinline isn't sufficient for benchmarking stuff that becomes trivial to optimize when the compiler can see everything. But I don't think clang has anything like that.

You can make functions opaque to the optimizer by putting them in separate source files and not using -flto or other option like gcc -fwhole-program

The only reason store/reload is involved with the function pointer is because you passed it by reference for no reason, even though it's just a single pointer. If you pass it by value (https://godbolt.org/z/WEvvsvoxb) you can see call rbx in the loop.

Apparently clang couldn't hoist the load because it wasn't sure the caller's function-pointer wouldn't be modified by the call, because it was making a stand-alone version of speed_test_with_param that would work with any caller and any arg, not just the one main passes. So constprop didn't happen.

An indirect call is can mispredict more easily, and yes store/reload adds a few cycles more latency before the prediction can be checked.

So yes, in general you'd expect it to be slower when the function to be called is a function-pointer arg, not a compile-time-constant fptr initialized within the calling function where the compiler can see the definition of what it's calling even if you artificially limit it.

If it becomes a call some_name instead of call rbx, that's still faster even if it does still have to loop like you were trying to make it.

(Microbenchmarking is hard, especially when you're trying to benchmark a C concept which can optimize differently depending on context; you have to know enough about compilers, optimization, and assembly to realize what makes the difference and what you're actually measuring. There isn't a meaningful answer to some questions, like "how fast or slow is the operator?", even if you limit it to integers, because it can optimize away with constants, or vectorize, or not depending on how it's used.)