Home > Software engineering >  Pandas Compare Two String Columns to Create a Third Column
Pandas Compare Two String Columns to Create a Third Column

Time:10-14

I have dataframe that contains two columns of different classes diamond, gold and silver.

class_pd = pd.DataFrame({'old_class':['gold', 'gold' , 'silver'],
    'new_class':['diamond', 'silver', 'silver']})

I want to create a new column that shows wither the classes was Upgraded or Downgraded

What I have tried

I wrote the below function to set the rules

def status_desc(class_pd, old_class, new_class):
    if ((class_pd['old_class'] == 'gold') & (class_pd['new_class'] == 'diamond') or \
       (class_pd['old_class'] == 'silver') & (class_pd['new_class'] == 'diamond') or \
       (class_pd['old_class'] == 'silver') & (class_pd['new_class'] == 'gold')):
        val = 'Upgrade'
    elif ((class_pd['old_class'] == 'diamond') & (class_pd['new_class'] == 'gold') or \
       (class_pd['old_class'] == 'diamond') & (class_pd['new_class'] == 'silver') or \
       (class_pd['old_class'] == 'gold') & (class_pd['new_class'] == 'silver')):
        val = 'Downgrade'
    else:
         val = 'NA'

Then I tried to apply the function to my dataframe using the below method

class_pd['class_desc'] = class_pd.apply(lambda x: status_desc(class_pd['old_class'], class_pd['new_class']), axis=1)

Error

I get this error

TypeError: status_desc() missing 1 required positional argument: new_class

Desired Output

class_pd = pd.DataFrame({'old_class':['gold', 'gold' , 'silver'],
    'new_class':['diamond', 'silver', 'silver'],
                        'class_desc':['Upgrade','Downgrade', 'NA']})

CodePudding user response:

Another solution with pd.Categorical, seems more elegant to me and more scalable:

categories = ['silver', 'gold', 'diamond']
class_pd = class_pd.apply(pd.Categorical, categories=categories, ordered=True)

class_pd['class_desc'] = 'NA'

class_pd.loc[class_pd.old_class > class_pd.new_class, 'class_desc'] = 'Downgrade'
class_pd.loc[class_pd.old_class < class_pd.new_class, 'class_desc'] = 'Upgrade'

We tell Pandas the inherent order, and can then use comparison operators.

Another way to do the last bit (after adding categories) suggested by @jezrael with numpy.select:

import numpy as np

conditions = [
    class_pd.old_class < class_pd.new_class,
    class_pd.old_class > class_pd.new_class,
    class_pd.old_class == class_pd.new_class,
]
labels = ["Upgrade", "Downgrade", "NA"]
class_pd["class_desc"] = np.select(conditions, labels)

CodePudding user response:

Your function status_desc takes 3 arguments: class_pd, old_class, new_class, but you are only passing 2 arguments class_pd['old_class'], class_pd['new_class']. You need to pass the first argument for class_pd as well. Also you're missing a few things:

  • you need to return the values, not just assign them to val. So return "Upgrade", "Downgrade" and "NA".
  • In you .apply you need to pass the x of the lambda function, if you pass class_pd you pass the whole dataframe. x contains a single row of the df, so you're looping through each row and the function looks at the old_class and new_class columns for each row for the logic.

However a simpler step would be to only have 1 argument (the row) and define your function like this since you're not even using old_class, new_class in your function:

def status_desc(class_pd):
    if ((class_pd['old_class'] == 'gold') & (class_pd['new_class'] == 'diamond') or \
       (class_pd['old_class'] == 'silver') & (class_pd['new_class'] == 'diamond') or \
       (class_pd['old_class'] == 'silver') & (class_pd['new_class'] == 'gold')):
        return 'Upgrade'
    elif ((class_pd['old_class'] == 'diamond') & (class_pd['new_class'] == 'gold') or \
       (class_pd['old_class'] == 'diamond') & (class_pd['new_class'] == 'silver') or \
       (class_pd['old_class'] == 'gold') & (class_pd['new_class'] == 'silver')):
        return 'Downgrade'
    else:
         return 'NA'

Then call it using:

class_pd['class_desc'] = class_pd.apply(lambda x: status_desc(x), axis=1)

Output using this code:

old_class   new_class   class_desc
0   gold    diamond     Upgrade
1   gold    silver      Downgrade
2   silver  silver      NA

CodePudding user response:

Here, the main logic is to provide rank list which will replicate the importance by position and then compare position number new and old using if else. Code:

rank = ['silver', 'gold', 'diamond'] #position silver = 0, gold=1 ,dia=2
class_pd['class_desc'] = class_pd.apply(lambda x: ('Upgrade' if (rank.index(x.old_class)) < (rank.index(x.new_class)) else 'Downgrade') if x.old_class != x.new_class else 'NA',axis=1)
class_pd

Output:

    old_class   new_class   class_desc
0   gold       diamond      Upgrade
1   gold       silver       Downgrade
2   silver     silver       NA

CodePudding user response:

Firstly, you need to give one more parameter which is "class_pd" to your function. Also you need to give indexes of column names. For instance instead of class_pd['old_class'] == 'gold' you need to write class_pd['old_class'][0] == 'gold'.

  • Related