Speedup an if-else ladder C-CodePudding

I have a piece of code that demands execution speed over anything else. By using the high_resolution_clock() from std::chrono I found out that this switch() into if-else() ladder is taking over 70% of my execution time. Is there any way to speed this up?

I'm using gcc with -O3 optimization during compiling.

I looked into a similar question: If else ladder optimisation but I can't use a return statement as it would exit the outer loop which I can't.

switch(RPL_OPTION) {
            case 0:
                for(int k = 0; k < WINDOW_SIZE; k  ) {
                    if(ans[k] >= upper_th) {
                        //Increasing flag counter
                        flag_count  ;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(1);

                    } else if(ans[k] < lower_th) {
                        //Increasing flag counter
                        flag_count  ;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(1);

                    } else {
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(0);
                    }
                }
                break;
            case 1:
                for(int k = 0; k < WINDOW_SIZE; k  ) {
                    if(ans[k] >= upper_th) {
                        //Increasing flag counter
                        flag_count  ;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(RPL_CONST);
                        flag_output.push_back(1);

                    } else if(ans[k] < lower_th) {
                        //Increasing flag counter
                        flag_count  ;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(RPL_CONST);
                        flag_output.push_back(1);

                    } else {
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(0);
                    }
                }
                break;
            case 2:
                for(int k = 0; k < WINDOW_SIZE; k  ) {
                    if(ans[k] >= upper_th) {
                        //Increasing flag counter
                        flag_count  ;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(upper_th);
                        flag_output.push_back(1);

                    } else if(ans[k] < lower_th) {
                        //Increasing flag counter
                        flag_count  ;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(lower_th);
                        flag_output.push_back(1);

                    } else {
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(0);
                    }
                }
                break;
            case 3:
                //Generating a gaussian noise distribution with 0 mean and 1 std deviation
                default_random_engine generator(time(0));
                normal_distribution<float> dist(0,1);

                for(int k = 0; k < WINDOW_SIZE; k  ) {
                    if(ans[k] >= upper_th) {
                        //Increasing flag counter
                        flag_count  ;
                        //Calling a random sample from the distribution and calculating a noise value
                        filtered_output.push_back(dist(generator)*sigma);
                        flag_output.push_back(1);
                        continue;

                    } else if(ans[k] < lower_th) {
                        //Increasing flag counter
                        flag_count  ;
                        //Calling a random sample from the distribution and calculating a noise value
                        filtered_output.push_back(dist(generator)*sigma);
                        flag_output.push_back(1);
                        continue;

                    } else {
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(0);
                    }
                }
                break;
        }

CodePudding user response：

I am nearly 98% sure that the if-else ladder is not the problem.

The std::vectors (or whatever container you use) push_back function with tons of reallocations and data copying is for me the main candidate for optimization.

Please use the reserve function to allocate the needed memory beforehand.

Then move out all invariant stuff, like

default_random_engine generator(time(0));
normal_distribution<float> dist(0,1);

But without more example code, it is hard to judge.

A profiler will give you better results. Timer functions will not help a lot here.

CodePudding user response：

A few optimizations that come to mind:

vector.push_back() or emplace_back(), even with reserve(), are poison for performance because no compiler is able to vectorize the code. We can work with plain C pointers instead or just preallocate.
Generating the random engine and distribution in the last case may have significant cost if this code is called repeatedly. We can hoist this out of the code. Note that this will also avoid issues with the repeated initialization for which you use a low-resolution time function.
This may be unnecessary but rewriting the code a bit may allow more compiler optimizations, especially by turning things into conditional move-instructions and reducing the number of branches.

/* TODO: We have better ways of initializing generators but that is
 * unrelated to its performance
 * I'm lazy and turn this into a static variable. Better use a
 * different pattern (like up in the stack somewhere)
 * but you get the idea
 */
static default_random_engine generator(time(0));
static normal_distribution<float> dist(0,1);

std::size_t output_pos = filtered_output.size();
filtered_output.resize(output_pos   WINDOW_SIZE);
flag_output.resize(output_pos   WINDOW_SIZE);

switch(RPL_OPTION) {
case 0:
    for(int k = 0; k < WINDOW_SIZE; k  ) {
        auto ansk = ans[k];
        int flag = (ansk >= upper_th) | (ansk < lower_th);
        flag_count  = flag;
        filtered_output[output_pos   k] = ansk;
        flag_output[output_pos   k] = flag;
    }
    break;
case 1:
    for(int k = 0; k < WINDOW_SIZE; k  ) {
        auto ansk = ans[k];
        int flag = (ansk >= upper_th) | (ansk < lower_th);
        flag_count  = flag;
        // written carefully to help compiler turning this into a CMOV
        auto filtered = flag ? RPL_CONST : ansk;
        filtered_output[output_pos   k] = filtered;
        flag_output[output_pos   k] = flag;
    }
    break;
case 2:
    for(int k = 0; k < WINDOW_SIZE; k  ) {
        auto ansk = ans[k];
        int flag = (ansk >= upper_th) | (ansk < lower_th);
        flag_count  = flag;
        auto filtered = ansk < lower_th ? lower_th : ansk;
        filtered = ansk >= upper_th ? upper_th : filtered;
        filtered_output[output_pos   k] = filtered;
        flag_output[output_pos   k] = flag;
    }
    break;
case 3:
    for(int k = 0; k < WINDOW_SIZE; k  ) {
        // optimized under the assumption that flag is usually 1
        auto ansk = ans[k];
        auto random = dist(generator) * sigma;
        int flag = (ansk >= upper_th) | (ansk < lower_th);
        auto filtered = flag ? random : ansk;
        filtered_output[output_pos   k] = filtered;
        flag_output[output_pos   k] = flag;
    }
    break;
}

Analyzing compiler output

I checked the resulting code with Godbolt. Cases 0-2 do vectorize. However, a lot hinges on good alias detection. So this needs to be analyzed in the context of the full function containing this code. Particular pain points are

Potential alias between ans and filtered_output. That is hard to avoid but I think compilers should be able to create code that check against this
Potential alias between the thresholds RPL_CONST and the filtered_output. When in doubt, copy the inputs into a local variable (which the compiler can prove to be alias free). Just marking them const may not be enough
Potential alias between flag_count and flag_output, depending on the types. Again, better use a local variable for the count, then copy it to its output, if required

As for case 3, computing a random sample is expensive enough that my optimization may degrade performance if the inputs are usually within limits. That needs benchmarking. The longer I think about it, losing a few clock cycles on a mis-predict is probably much less time than computing a sample without using it.

CodePudding user response：

The first thing to notice is that you push_back on vectors. The code shows no call to reserve so this will resize the vector as it grows over and over. That's probably more expensive than anything else in the loop.

The next things concerns the "if-else-ladder":

Have you actually profiled if the ladder is a problem at all? Branches are only expensive when they mispredict. Maybe the branch predictor works juse fine on your input? Assuming this switch statement is executed many times one way to help it out would be to sort the input. Then the if-else-ladder wouldn't jump randomly every time but repeat the same way many times before switching to a new case. But that only helps if the loop runs many times or the cost of sorting will negate any improvement.

And if the switch is repeated many times you can split the input into 3 groups once, for the 3 options in the ladder, and process each group without any if-else-ladder.

Looking closer at the code I see the first 2 cases in the if-else-ladder are identical (except case 2). So you can combine the tests making this a simple "if-else":

if ((ans[k] >= upper_th) || (ans[k] < lower_th))

Now due to lazy evaluation this will produce the same code as before. But you can do better:

if ((ans[k] >= upper_th) | (ans[k] < lower_th))

Now both parts get evaluated since there is no lazy evaluation of |. Except compilers are artifical stupids and might just do lazy evaluation anyway. At this point you are fighting the optimizer. Some compilers will optimize this into 2 branches, some leave it at one.

You can use something like the following trick there:

static auto fn[] = {
    [&]() { code for first choice; };
    [&]() { code for second choice; };
};
fn[((ans[k] >= upper_th) | (ans[k] < lower_th))]();

By turning the condition of your if into computing the index of an array the compiler optimization producing 2 branches is circumvented. Hopefully. At least till the next compiler update. :)

When fighting the optimizer you have to recheck your solution every time you update the compiler.

And for case 2 the difference in the code is only what value to push_back. That can be turned into a conditional move on most architectures instead of a branch if you use

(ans[k] >= upper_th) ? upper_th : lower_th;

for the push_back.

CodePudding user response：

I have a piece of code that demands execution speed over anything else

That suggests a non-obvious approach. The code pattern looks familiar enough from a signal processing viewpoint, so WINDOW_SIZE is likely non-trivial. In that case, using AVX2 with packed comparisons makes sense.

In short, you pack a whole AVX2 register full of inputs, use two AVX2 registers to store copies of the lower and upper threshold, and issue the two comparisons. This gives you two outputs, where each value is either 0 or ~0.

Hence, your flag count be determined by or-ing the two registers. It's tempting to count the flags already, but this is considered a slow "horizontal add". Better to keep track of this in another AVX register, and do one horizontal add at the end.

The updates to filtered_output depend on the case, but for 1 and 2 you can use AVX for this as well. case 0: of course should just be a straight copy to filtered_output, outside the if statements.

CodePudding user response：

Firstly, sorting ans could be a good idea because of bench prediction. But it is up to mostly size of the ans . Secondly, if you are using c 20 you can take a look at [LIKELY] and [UNLIKELY] keywords. If you can select which statement is picking mostly or the opposite you can easily use them.