Fellow contributors, I have written a program that is meant to be applied on a data set of more than a million observations. At some point of the program I need to do row-wise operations on a pandas data frame where considering the number of observations it could take a while to be executed. I would like to find a more efficient way of doing it in particular if there is an efficient vectorized way to do it as we would do in the same situation in R. Here is a sample data set:
df = pd.DataFrame({'A' : ['Yes', 'No', 'Yes', 'Yes', 'No'],
'B' : ['Blue', 'Red', 'Red', 'Blue', 'Blue']})
And my code (However in real code there are more if-else expressions to be checked):
def my_fun(row):
if (row['A'] == 'Yes') & (row['B'] == 'Blue'):
return 'foo'
elif (row['A'] == 'Yes') & (row['B'] == 'Red'):
return 'bar'
else:
return 'foobar'
return row
df['C'] = df.apply(my_fun, axis = 1)
Which results in:
df
A B C
0 Yes Blue foo
1 No Red foobar
2 Yes Red bar
3 Yes Blue foo
4 No Blue foobar
Thank you very much in advance.
CodePudding user response:
You can use np.select
df['D'] = np.select(
[df['A'].eq('Yes') & df['B'].eq('Blue'),
df['A'].eq('Yes') & df['B'].eq('Red')],
['foo', 'bar'],
'foobar'
)
print(df)
A B C D
0 Yes Blue foo foo
1 No Red foobar foobar
2 Yes Red bar bar
3 Yes Blue foo foo
4 No Blue foobar foobar
CodePudding user response:
That's exactly what the built-in pandas loc
is made for.
df['C'] = 'foobar'
df.loc[(df['A'] == 'Yes') & (df['B'] == 'Red'), 'C'] = 'bar'
df.loc[(df['A'] == 'Yes') & (df['B'] == 'Blue'), 'C'] = 'foo'