Changing labels for Pandas rows that have the same value-CodePudding

I want to change labels in the Pandas dataframe for the row that have the same value but different label:

import pandas as pd

df = pd.DataFrame({"text": ["bannana", "tomato", "potato", "potato", "lemon", "cucamber"],
                   "label": ["fruit", "veg", "fruit", "veg", "fruit", "veg"], 
                    })
                    
print(df)


    text  label
0   bannana  fruit
1    tomato    veg
2    potato  fruit
3    potato    veg
4     lemon  fruit
5  cucamber    veg

As you see, there are 2 elements in text that have diferent label

2    potato  fruit
3    potato    veg

So I guess that first, I need to identify if there are rows like this, and then to update the values in the label column. Note, I always want to change from fruit to veg.

Desired output:

    text  label
0   bannana  fruit
1    tomato    veg
2    potato    veg
3    potato    veg
4     lemon  fruit
5  cucamber    veg

CodePudding user response：

Items with the same text but different values can be extracted as follows:

df.groupby('text').filter(lambda x: x['label'].nunique() > 1)

result

    text    label
2   potato  fruit
3   potato  veg

Change is impossible without logic. You need to create logic for how to change

Update

don need filtering dataframe for change just make conditions and mask

cond1 : same text but different values
cond2 : label is fruit

then fruit of item more than one label is replaced with veg.

cond1 = df.groupby('text')['label'].transform(lambda x: x.nunique() > 1)
cond2 = df['label'].eq('fruit')
df['label'] = df['label'].mask(cond1 & cond2, 'veg')

result:

    text        label
0   bannana     fruit
1   tomato      veg
2   potato      veg
3   potato      veg
4   lemon       fruit
5   cucamber    veg

CodePudding user response：

This can be accomplished with the following code:

df.sort_values(by="label", ascending=False).groupby("text").label.first()
df["label"] = df["text"].map(dict(zip(df.text, df.label)))

Let's take a look at what's going on here:

First we sort the dataframe by labels in lexicographic descending order: all rows labelled with "veg" will appear before rows labelled with "fruit".
We then group by text, collapsing the rows with the same "text" value (in this example, potato).
For each group, we take the first element: as the dataframe is sorted, if "veg" is present in the group, it will be chosen.

That give us df_map, a dataframe containing mappings from text to label. We can then convert it to a dictionary and apply these mappings to the original dataframe using the DataFrame.map method.

Note: Something handy about this approach is that it's very simple to extend if you have more labels than "fruit" and "veg" and want to define a custom label priority:

order = {"fruit":0, "veg":1, "something_that_should_supersede_veg":2}
df_map = df.sort_values(by="label", key=lambda x:x.map(order), ascending=False).groupby("text").label.first()