How do you compare columns 'a' and 'b' to return 'c' or 'd'?-CodePudding

I am trying to compare two columns and then return a third value from one of the two adjacent columns. I have read that using iterrows is not the correct way to accomplish this so I tried making writing my own function. The trouble is figuring out the correct syntax to apply it to the df.

import pandas as pd
d = {'a':[1,2,3], 'b':[4,1,6], 'c':[6,7,8], 'd':[8,9,0]}
df = pd.DataFrame(d)
print(df)


def area_name_final(ms1, ms2, an1, an2):
    if ms1 >= ms2:
        return an1
    else:
        return an2

df['e'] = df.apply(area_name_final(df.a, df.b, df.c, df.d), axis=1)

Error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Desired Output:

   a  b  c  d  e
0  1  4  6  8  8
1  2  1  7  9  7
2  3  6  8  0  0

CodePudding user response：

You can try np.where

import numpy as np

df['e'] = np.where(df['a'] >= df['b'], df['c'], df['d'])

print(df)

   a  b  c  d  e
0  1  4  6  8  8
1  2  1  7  9  7
2  3  6  8  0  0

To fix your code, you need to pass row not the column to apply function

def area_name_final(row):
    if row['a'] >= row['b']:
        return row['c']
    else:
        return row['d']

df['e'] = df.apply(area_name_final, axis=1)

CodePudding user response：

You can use a simple where condition that will be much more efficient (vectorized) than your custom function:

df['e'] = df['c'].where(df['a'].ge(df['b']), df['d'])

output:

   a  b  c  d  e
0  1  4  6  8  8
1  2  1  7  9  7
2  3  6  8  0  0

CodePudding user response：

Using np.where is definitely a good option. There's another way to do it fancily without calling numpy library.

ddf = pd.MultiIndex.from_frame(df)
result = [i[2] if i[0] >= i[1] else i[3] for i in ddf]
df['e'] = result

df
Out[9]: 
   a  b  c  d  e
0  1  4  6  8  8
1  2  1  7  9  7
2  3  6  8  0  0

As pandas' Multiindex helps to turn all your data in dataframe into rows of tuples, you can easily then compare components in a list/tuple.

Extra

However, of course, np.where will give you the result faster.

def solution_1(df):
    df['e'] = np.where(df['a'] >= df['b'], df['c'], df['d'])

def solution_2(df):
    ddf = pd.MultiIndex.from_frame(df)
    df['e'] = [i[2] if i[0] >= i[1] else i[3] for i in ddf]

%timeit solution_1(df)
268 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit solution_2(df)
1.6 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If you create a global pd.MultIndex dataframe, on the other hand, the solution will be faster.

def solution_3(df):
    df['e'] = [i[2] if i[0] >= i[1] else i[3] for i in ddf]

%timeit solution_3(df)
60.5 µs ± 722 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)