Home > Mobile >  Changing labels for Pandas rows that have the same value
Changing labels for Pandas rows that have the same value

Time:12-14

I want to change labels in the Pandas dataframe for the row that have the same value but different label:

import pandas as pd

df = pd.DataFrame({"text": ["bannana", "tomato", "potato", "potato", "lemon", "cucamber"],
                   "label": ["fruit", "veg", "fruit", "veg", "fruit", "veg"], 
                    })
                    
print(df)


    text  label
0   bannana  fruit
1    tomato    veg
2    potato  fruit
3    potato    veg
4     lemon  fruit
5  cucamber    veg

As you see, there are 2 elements in text that have diferent label

2    potato  fruit
3    potato    veg

So I guess that first, I need to identify if there are rows like this, and then to update the values in the label column. Note, I always want to change from fruit to veg.

Desired output:

    text  label
0   bannana  fruit
1    tomato    veg
2    potato    veg
3    potato    veg
4     lemon  fruit
5  cucamber    veg

CodePudding user response:

Items with the same text but different values ​​can be extracted as follows:

df.groupby('text').filter(lambda x: x['label'].nunique() > 1)

result

    text    label
2   potato  fruit
3   potato  veg

Change is impossible without logic. You need to create logic for how to change


Update

don need filtering dataframe for change just make conditions and mask

  1. cond1 : same text but different values
  2. cond2 : label is fruit

then fruit of item more than one label is replaced with veg.

cond1 = df.groupby('text')['label'].transform(lambda x: x.nunique() > 1)
cond2 = df['label'].eq('fruit')
df['label'] = df['label'].mask(cond1 & cond2, 'veg')

result:

    text        label
0   bannana     fruit
1   tomato      veg
2   potato      veg
3   potato      veg
4   lemon       fruit
5   cucamber    veg

CodePudding user response:

This can be accomplished with the following code:

df.sort_values(by="label", ascending=False).groupby("text").label.first()
df["label"] = df["text"].map(dict(zip(df.text, df.label)))

Let's take a look at what's going on here:

  • First we sort the dataframe by labels in lexicographic descending order: all rows labelled with "veg" will appear before rows labelled with "fruit".
  • We then group by text, collapsing the rows with the same "text" value (in this example, potato).
  • For each group, we take the first element: as the dataframe is sorted, if "veg" is present in the group, it will be chosen.

That give us df_map, a dataframe containing mappings from text to label. We can then convert it to a dictionary and apply these mappings to the original dataframe using the DataFrame.map method.

Note: Something handy about this approach is that it's very simple to extend if you have more labels than "fruit" and "veg" and want to define a custom label priority:

order = {"fruit":0, "veg":1, "something_that_should_supersede_veg":2}
df_map = df.sort_values(by="label", key=lambda x:x.map(order), ascending=False).groupby("text").label.first()
  • Related