I have a loop like:
#pragma omp parallel for num_threads(threads) schedule(static)
for (int i = 0; i < ndata; i ) {
result[i] = calculate(data[i]);
}
with (a simplified version of) the function calculate() being:
double calculate(double in) {
if (in < LP) {
out = c.b1 * in;
} else if (in < SP) {
out = c.a2 c.b2 * pow((c.c2 c.d2 * in), c.e2);
} else if (in < HP) {
out = c.a3 c.b3 * pow((c.c3 c.d3 * in), c.e3);
} else {
out = c.a4 c.b4 * in;
}
return out;
}
All calculation variables are double. It's an image processing routine so ndata can be 3 x number of pixels, or for modern cameras ~1E8, and I'm trying to make the routine as responsive as possible. The calculation needed is either simple addition / multiplication or a more expensive call to pow(), depending on the subpixel value being processed. I've already done a lot of precalculation outside the loop and I'm using OpenMP to handle parallelising of the loop, but is there anything more I can do to optimise this? I'm guessing it won't auto-vectorise particularly well given that for n successive passes round the loop you might have a mix of pow() and simple calculations.
CodePudding user response:
Consider using arrays in your struct instead of name-numbered members. That will allow you to do something like:
for (int i = 0; i < ndata; i ){
size_t j = 0;
j = (data[i] >= LP);
j = (data[i] >= SP);
j = (data[i] >= HP);
result[i] = c.a[j] c.b[j] *
pow((c.c[j] c.d[j] * data[i]), c.e[j]);
}
Then just populate those arrays with 0.0f and 1.0f as appropriate to make the function work.
From there it's just a matter of optimizing a specialized pow
function for inlining and vectorization. As a bonus this should operate in constant time as long as your pow function does but at the cost of possibly unnecessary calculations for a good portion of the data - whether it's worth it or not will depend on the data set.