How to make a computationally expensive row-wise operation into efficient vectorized one?-CodePudding

Fellow contributors, I have written a program that is meant to be applied on a data set of more than a million observations. At some point of the program I need to do row-wise operations on a pandas data frame where considering the number of observations it could take a while to be executed. I would like to find a more efficient way of doing it in particular if there is an efficient vectorized way to do it as we would do in the same situation in R. Here is a sample data set:

df = pd.DataFrame({'A' : ['Yes', 'No', 'Yes', 'Yes', 'No'],
                   'B' : ['Blue', 'Red', 'Red', 'Blue', 'Blue']})

And my code (However in real code there are more if-else expressions to be checked):

def my_fun(row):
    if (row['A'] == 'Yes') & (row['B'] == 'Blue'):
        return 'foo'
    elif (row['A'] == 'Yes') & (row['B'] == 'Red'):
        return 'bar'
    else:
        return 'foobar'

    return row

df['C'] = df.apply(my_fun, axis = 1)

Which results in:

df
     A     B       C
0  Yes  Blue     foo
1   No   Red  foobar
2  Yes   Red     bar
3  Yes  Blue     foo
4   No  Blue  foobar

Thank you very much in advance.

CodePudding user response：

You can use np.select

df['D'] = np.select(
    [df['A'].eq('Yes') & df['B'].eq('Blue'),
     df['A'].eq('Yes') & df['B'].eq('Red')],
    ['foo', 'bar'],
    'foobar'
)

print(df)

     A     B       C       D
0  Yes  Blue     foo     foo
1   No   Red  foobar  foobar
2  Yes   Red     bar     bar
3  Yes  Blue     foo     foo
4   No  Blue  foobar  foobar

CodePudding user response：

That's exactly what the built-in pandas loc is made for.

df['C'] = 'foobar'
df.loc[(df['A'] == 'Yes') & (df['B'] == 'Red'), 'C'] = 'bar'
df.loc[(df['A'] == 'Yes') & (df['B'] == 'Blue'), 'C'] = 'foo'