I need to traverse a vector
, read each element, and map to the modulo division value. Modulo division is fast for divisors of power2. So, I need to choose between a mod
and mod_power2
during the runtime. Following is a rough outline. Please assume that I am using templates to visit the vector.
Bit manipulation tricks were taken from https://graphics.stanford.edu/~seander/bithacks.html
static inline constexpr bool if_power2(int v) {
return v && !(v & (v - 1));
}
static inline constexpr int mod_power2(int val, int num_partitions) {
return val & (num_partitions - 1);
}
static inline constexpr int mod(int val, int num_partitions) {
return val % num_partitions;
}
template<typename Func>
void visit(const std::vector<int> &data, Func &&func) {
for (size_t i = 0; i < data.size(); i ) {
func(i, data[i]);
}
}
void run1(const std::vector<int> &v1, int num_partitions, std::vector<int> &v2) {
if (if_power2(num_partitions)) {
visit(v1,
[&](size_t i, int v) {
v2[i] = mod_power2(v, num_partitions);
});
} else {
visit(v1,
[&](size_t i, int v) {
v2[i] = mod(v, num_partitions);
});
}
}
void run2(const std::vector<int> &v1, int num_partitions, std::vector<int> &v2) {
const auto part = if_power2(num_partitions) ? mod_power2 : mod;
visit(v1, [&](size_t i, int v) {
v2[i] = part(v, num_partitions);
});
}
My question is, run1
vs run2
. I prefer run2
because it is easy to read and no code duplication. But when when I check both in godbolt (https://godbolt.org/z/3ov59rb5s), AFAIU, run1
is inlined better than run2
.
So, is there a better way to write a run
function without compromising on the perf?
FOLLOW UP:
I ran a quickbench on this. https://quick-bench.com/q/zO5YJtDMbnd10pk53SauOT6Bu0g
It looks like for clang, there's no difference. But for GCC (10.3) there's a 2x perf gain for run1
.
CodePudding user response:
"Issue" with cond ? mod_power2 : mod
is that it is a function pointer, which is harder to inline.
Different lambdas have no common type. Using type-erasure such as std::function
has overhead, and devirtualization is even harder to optimize.
So, only option I see is to be able to write run1
in a "nicer" way:
Factorize the creation of the lambda; you need to turn mod
/mod_power2
into functors (else we have same issue than run2
Demo):
void run3(const std::vector<int> &v1, int num_partitions, std::vector<int> &v2) {
auto make_lambda = [&](auto&& f){
return [&](std::size_t i, int v){ v2[i] = f(v, num_partitions); };
};
if (if_power2(num_partitions)) {
visit(v1, make_lambda(mod_power2));
} else {
visit(v1, make_lambda(mod));
}
}
If you cannot change mod
/mod_power2
into functors, then create a functor instead of the lambda to got their address at compile time:
template <int (*f)(int val, int num_partitions)>
struct Functor
{
std::vector<int> &v2;
int num_partitions;
void operator ()(size_t i, int v) const
{
v2[i] = f(v, num_partitions);
}
};
void run4(const std::vector<int> &v1, int num_partitions, std::vector<int> &v2) {
if (if_power2(num_partitions)) {
visit(v1, Functor<mod_power2>{v2, num_partitions});
} else {
visit(v1, Functor<mod>{v2, num_partitions});
}
}