choose a constexpr based on a runtime value and use it inside a hot loop-CodePudding

I need to traverse a vector, read each element, and map to the modulo division value. Modulo division is fast for divisors of power2. So, I need to choose between a mod and mod_power2 during the runtime. Following is a rough outline. Please assume that I am using templates to visit the vector.

Bit manipulation tricks were taken from https://graphics.stanford.edu/~seander/bithacks.html

static inline constexpr bool if_power2(int v) {
  return v && !(v & (v - 1));
}

static inline constexpr int mod_power2(int val, int num_partitions) {
  return val & (num_partitions - 1);
}

static inline constexpr int mod(int val, int num_partitions) {
  return val % num_partitions;
}

template<typename Func>
void visit(const std::vector<int> &data, Func &&func) {
  for (size_t i = 0; i < data.size(); i  ) {
    func(i, data[i]);
  }
}

void run1(const std::vector<int> &v1, int num_partitions, std::vector<int> &v2) {
  if (if_power2(num_partitions)) {
    visit(v1,
          [&](size_t i, int v) {
            v2[i] = mod_power2(v, num_partitions);
          });
  } else {
    visit(v1,
          [&](size_t i, int v) {
            v2[i] = mod(v, num_partitions);
          });
  }
}

void run2(const std::vector<int> &v1, int num_partitions, std::vector<int> &v2) {
  const auto part = if_power2(num_partitions) ? mod_power2 : mod;

  visit(v1, [&](size_t i, int v) {
    v2[i] = part(v, num_partitions);
  });
}

My question is, run1 vs run2. I prefer run2 because it is easy to read and no code duplication. But when when I check both in godbolt (https://godbolt.org/z/3ov59rb5s), AFAIU, run1 is inlined better than run2.

So, is there a better way to write a run function without compromising on the perf?

FOLLOW UP:

I ran a quickbench on this. https://quick-bench.com/q/zO5YJtDMbnd10pk53SauOT6Bu0g

It looks like for clang, there's no difference. But for GCC (10.3) there's a 2x perf gain for run1.

CodePudding user response：

"Issue" with cond ? mod_power2 : mod is that it is a function pointer, which is harder to inline.

Different lambdas have no common type. Using type-erasure such as std::function has overhead, and devirtualization is even harder to optimize.

So, only option I see is to be able to write run1 in a "nicer" way:

Factorize the creation of the lambda; you need to turn mod/mod_power2 into functors (else we have same issue than run2 Demo):

void run3(const std::vector<int> &v1, int num_partitions, std::vector<int> &v2) {
  auto make_lambda = [&](auto&& f){
    return [&](std::size_t i, int v){ v2[i] = f(v, num_partitions); };
  };
  if (if_power2(num_partitions)) {
    visit(v1, make_lambda(mod_power2));
  } else {
    visit(v1, make_lambda(mod));
  }
}

Demo

If you cannot change mod/mod_power2 into functors, then create a functor instead of the lambda to got their address at compile time:

template <int (*f)(int val, int num_partitions)>
struct Functor
{
  std::vector<int> &v2;
  int num_partitions;

  void operator ()(size_t i, int v) const
  {
    v2[i] = f(v, num_partitions);
  }

};

void run4(const std::vector<int> &v1, int num_partitions, std::vector<int> &v2) {
  if (if_power2(num_partitions)) {
    visit(v1, Functor<mod_power2>{v2, num_partitions});
  } else {
    visit(v1, Functor<mod>{v2, num_partitions});
  }
}

Demo