I am trying to compare two columns and then return a third value from one of the two adjacent columns. I have read that using iterrows is not the correct way to accomplish this so I tried making writing my own function. The trouble is figuring out the correct syntax to apply it to the df.
import pandas as pd
d = {'a':[1,2,3], 'b':[4,1,6], 'c':[6,7,8], 'd':[8,9,0]}
df = pd.DataFrame(d)
print(df)
def area_name_final(ms1, ms2, an1, an2):
if ms1 >= ms2:
return an1
else:
return an2
df['e'] = df.apply(area_name_final(df.a, df.b, df.c, df.d), axis=1)
Error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Desired Output:
a b c d e
0 1 4 6 8 8
1 2 1 7 9 7
2 3 6 8 0 0
CodePudding user response:
You can try np.where
import numpy as np
df['e'] = np.where(df['a'] >= df['b'], df['c'], df['d'])
print(df)
a b c d e
0 1 4 6 8 8
1 2 1 7 9 7
2 3 6 8 0 0
To fix your code, you need to pass row not the column to apply function
def area_name_final(row):
if row['a'] >= row['b']:
return row['c']
else:
return row['d']
df['e'] = df.apply(area_name_final, axis=1)
CodePudding user response:
You can use a simple where
condition that will be much more efficient (vectorized) than your custom function:
df['e'] = df['c'].where(df['a'].ge(df['b']), df['d'])
output:
a b c d e
0 1 4 6 8 8
1 2 1 7 9 7
2 3 6 8 0 0
CodePudding user response:
Using np.where
is definitely a good option. There's another way to do it fancily without calling numpy library.
ddf = pd.MultiIndex.from_frame(df)
result = [i[2] if i[0] >= i[1] else i[3] for i in ddf]
df['e'] = result
df
Out[9]:
a b c d e
0 1 4 6 8 8
1 2 1 7 9 7
2 3 6 8 0 0
As pandas' Multiindex helps to turn all your data in dataframe into rows of tuples, you can easily then compare components in a list/tuple.
Extra
However, of course, np.where
will give you the result faster.
def solution_1(df):
df['e'] = np.where(df['a'] >= df['b'], df['c'], df['d'])
def solution_2(df):
ddf = pd.MultiIndex.from_frame(df)
df['e'] = [i[2] if i[0] >= i[1] else i[3] for i in ddf]
%timeit solution_1(df)
268 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit solution_2(df)
1.6 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you create a global pd.MultIndex dataframe, on the other hand, the solution will be faster.
def solution_3(df):
df['e'] = [i[2] if i[0] >= i[1] else i[3] for i in ddf]
%timeit solution_3(df)
60.5 µs ± 722 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)