how to use numpy-like vectorization properly to accelerate complex condition evaluation in pandas da-CodePudding

numpy/pandas are known famous for their underlying acceleration, i.e. vectorization.

condition evaluation are common expressions that occurs in codes everywhere.

However, when using pandas dataframe apply function intuitively, the condition evaluation seems very slow.

An example of my apply code looks like:

 def condition_eval(df):
        x=df['x']
        a=df['a']
        b=df['b']
        if x <= a:
            d = round((x-a)/0.01)-1
            if d <- 10:
                d = -10
        elif x >= b:
            d = round((x-b)/0.01) 1
            if d > 10:
                d = 10
        else:
            d = 0 
        return d
df['eval_result'] = df.apply(condition_eval, axis=1)

The properties of such kind of problems could be:

the result can be computed with only using its own row data, and always using multiple columns.
each row has the same computation algorithm.
the algorithm may contain complex conditional branches.

What's the best practice in numpy/pandas to solve such kind of problems?

Some more thinkings.

In my opinion, one of the reason why vectorization acceleration can be effective is because the underlying cpu has some kind of vector instructions(e.g. SIMD, intel avx), which rely on a truth that the computational instructions have a deterministic behavior, i.e. no matter how the input data is, the result could be acquired after a fixed number of cpu cycles. Thus, parallelizing such kind of operations is easy.

However, branch execution in cpu is much more complicated. First of all, different branches of the same condition evaluation have different execution paths thus they may result in different cpu cycles. Modern cpus even leverage a lot of tricks like branch prediction which create more uncertainties.

So I wonder if and how pandas try to accelerate such kind of vector condition evaluation operations, and is their a better practice to work on such kind of computational workloads.

CodePudding user response：

IIUC, this should be equivalent. If you provide example data and expected output, I'd be happy to test it and explain further.

import pandas as pd
import numpy as np

def get_eval_result(df):
    conditions = (
        df.x.le(df.a),
        df.x.gt(df.b),
    )
    choices = (
        np.where((d := df.x.sub(df.a).div(0.01).round().sub(1)).lt(-10), -10, d),
        np.where((d := df.x.sub(df.b).div(0.01).round().add(1)).gt(10), 10, d), 
    )
    return np.select(conditions, choices, 0)

df = df.assign(eval_result=get_eval_result)

CodePudding user response：

np.select is best for this:

(df
 .assign(column_to_alter=lambda x: np.select([cond1, cond2, cond3],
                                             [option1, opt2, opt3],
                                              default='somevalue'))
   
)