Pandas Compare Two String Columns to Create a Third Column-CodePudding

I have dataframe that contains two columns of different classes diamond, gold and silver.

class_pd = pd.DataFrame({'old_class':['gold', 'gold' , 'silver'],
    'new_class':['diamond', 'silver', 'silver']})

I want to create a new column that shows wither the classes was Upgraded or Downgraded

What I have tried

I wrote the below function to set the rules

def status_desc(class_pd, old_class, new_class):
    if ((class_pd['old_class'] == 'gold') & (class_pd['new_class'] == 'diamond') or \
       (class_pd['old_class'] == 'silver') & (class_pd['new_class'] == 'diamond') or \
       (class_pd['old_class'] == 'silver') & (class_pd['new_class'] == 'gold')):
        val = 'Upgrade'
    elif ((class_pd['old_class'] == 'diamond') & (class_pd['new_class'] == 'gold') or \
       (class_pd['old_class'] == 'diamond') & (class_pd['new_class'] == 'silver') or \
       (class_pd['old_class'] == 'gold') & (class_pd['new_class'] == 'silver')):
        val = 'Downgrade'
    else:
         val = 'NA'

Then I tried to apply the function to my dataframe using the below method

class_pd['class_desc'] = class_pd.apply(lambda x: status_desc(class_pd['old_class'], class_pd['new_class']), axis=1)

Error

I get this error

TypeError: status_desc() missing 1 required positional argument: new_class

Desired Output

class_pd = pd.DataFrame({'old_class':['gold', 'gold' , 'silver'],
    'new_class':['diamond', 'silver', 'silver'],
                        'class_desc':['Upgrade','Downgrade', 'NA']})

CodePudding user response：

Another solution with pd.Categorical, seems more elegant to me and more scalable:

categories = ['silver', 'gold', 'diamond']
class_pd = class_pd.apply(pd.Categorical, categories=categories, ordered=True)

class_pd['class_desc'] = 'NA'

class_pd.loc[class_pd.old_class > class_pd.new_class, 'class_desc'] = 'Downgrade'
class_pd.loc[class_pd.old_class < class_pd.new_class, 'class_desc'] = 'Upgrade'

We tell Pandas the inherent order, and can then use comparison operators.

Another way to do the last bit (after adding categories) suggested by @jezrael with numpy.select:

import numpy as np

conditions = [
    class_pd.old_class < class_pd.new_class,
    class_pd.old_class > class_pd.new_class,
    class_pd.old_class == class_pd.new_class,
]
labels = ["Upgrade", "Downgrade", "NA"]
class_pd["class_desc"] = np.select(conditions, labels)

CodePudding user response：

Your function status_desc takes 3 arguments: class_pd, old_class, new_class, but you are only passing 2 arguments class_pd['old_class'], class_pd['new_class']. You need to pass the first argument for class_pd as well. Also you're missing a few things:

you need to return the values, not just assign them to val. So return "Upgrade", "Downgrade" and "NA".
In you .apply you need to pass the x of the lambda function, if you pass class_pd you pass the whole dataframe. x contains a single row of the df, so you're looping through each row and the function looks at the old_class and new_class columns for each row for the logic.

However a simpler step would be to only have 1 argument (the row) and define your function like this since you're not even using old_class, new_class in your function:

def status_desc(class_pd):
    if ((class_pd['old_class'] == 'gold') & (class_pd['new_class'] == 'diamond') or \
       (class_pd['old_class'] == 'silver') & (class_pd['new_class'] == 'diamond') or \
       (class_pd['old_class'] == 'silver') & (class_pd['new_class'] == 'gold')):
        return 'Upgrade'
    elif ((class_pd['old_class'] == 'diamond') & (class_pd['new_class'] == 'gold') or \
       (class_pd['old_class'] == 'diamond') & (class_pd['new_class'] == 'silver') or \
       (class_pd['old_class'] == 'gold') & (class_pd['new_class'] == 'silver')):
        return 'Downgrade'
    else:
         return 'NA'

Then call it using:

class_pd['class_desc'] = class_pd.apply(lambda x: status_desc(x), axis=1)

Output using this code:

old_class   new_class   class_desc
0   gold    diamond     Upgrade
1   gold    silver      Downgrade
2   silver  silver      NA

CodePudding user response：

Here, the main logic is to provide rank list which will replicate the importance by position and then compare position number new and old using if else. Code:

rank = ['silver', 'gold', 'diamond'] #position silver = 0, gold=1 ,dia=2
class_pd['class_desc'] = class_pd.apply(lambda x: ('Upgrade' if (rank.index(x.old_class)) < (rank.index(x.new_class)) else 'Downgrade') if x.old_class != x.new_class else 'NA',axis=1)
class_pd

Output:

    old_class   new_class   class_desc
0   gold       diamond      Upgrade
1   gold       silver       Downgrade
2   silver     silver       NA

CodePudding user response：

Firstly, you need to give one more parameter which is "class_pd" to your function. Also you need to give indexes of column names. For instance instead of class_pd['old_class'] == 'gold' you need to write class_pd['old_class'][0] == 'gold'.