Home > Back-end >  How do I compare values in columns in a pandas dataframe using a function?
How do I compare values in columns in a pandas dataframe using a function?

Time:03-18

I have a pandas dataframe and would like to create a new column based on the below condition:

def confidence_level(row):
    if (row['ctry_one'] == row['ctry_two']) and (row['Market'] == 'yes'):
        return 'H'
    
    if (row['ctry_one'] == row['ctry_two']) and (row['Market'] == 'no'):
        return 'M'
    
    if (row['ctry_one'] != row['ctry_two']) and (row['Market'] == 'yes'):
        return 'M'
    
    if (row['ctry_one'] != row['ctry_two']) and (row['Market'] == 'no'):
        return 'L'

df['status'] = confidence_level(df)

This is the error I receive:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
func_test['Confidence'].value_counts()

Has anyone experienced this before? I tried applying .all() at the end of each argument like below, but this just returns 'None' for everything:

def confidence_level(row):
    if (row['ctry_one'] == row['ctry_two']).all() and (row['Market'] == 'yes').all():
        return 'H'
    
    if (row['ctry_one'] == row['ctry_two']).all() and (row['Market'] == 'no').all():
        return 'M'
    
    if (row['ctry_one'] != row['ctry_two']).all() and (row['Market'] == 'yes').all():
        return 'M'
    
    if (row['ctry_one'] != row['ctry_two']).all() and (row['Market'] == 'no').all():
        return 'L'

CodePudding user response:

You need to call your function for each row, rather than for the whole dataframe, like this:

df['status'] = df.apply(confidence_level, axis=1)

That said, using np.select like Mayank's solution or using .loc like this will be run faster:

def confidence_level(df):
    new_df = df.copy()
    new_df.loc[(df['ctry_one'] == df['ctry_two']) & (df['Market'] == 'yes'), 'status'] = 'H'
    new_df.loc[(df['ctry_one'] == df['ctry_two']) & (df['Market'] == 'no'), 'status'] = 'M'
    new_df.loc[(df['ctry_one'] != df['ctry_two']) & (df['Market'] == 'yes'), 'status'] = 'M'
    new_df.loc[(df['ctry_one'] != df['ctry_two']) & (df['Market'] == 'no'), 'status'] = 'L'
    return df

df = confidence_level(df)

CodePudding user response:

Use numpy.select instead, which is more performant and readable:

import numpy as np

conditions = [(df['ctry_one'] == df['ctry_two']) & (df['Market'] == 'yes'), (df['ctry_one'] == df['ctry_two']) & (df['Market'] == 'no'), (df['ctry_one'] != df['ctry_two']) & (df['Market'] == 'yes'), (df['ctry_one'] != df['ctry_two']) & (df['Market'] == 'no')]

choices = ['H', 'M', 'M', 'L']

df['status'] = np.select(conditions, choices)
  • Related