I want to change labels in the Pandas dataframe for the row that have the same value but different label:
import pandas as pd
df = pd.DataFrame({"text": ["bannana", "tomato", "potato", "potato", "lemon", "cucamber"],
"label": ["fruit", "veg", "fruit", "veg", "fruit", "veg"],
})
print(df)
text label
0 bannana fruit
1 tomato veg
2 potato fruit
3 potato veg
4 lemon fruit
5 cucamber veg
As you see, there are 2 elements in text that have diferent label
2 potato fruit
3 potato veg
So I guess that first, I need to identify if there are rows like this, and then to update the values in the label column. Note, I always want to change from fruit to veg.
Desired output:
text label
0 bannana fruit
1 tomato veg
2 potato veg
3 potato veg
4 lemon fruit
5 cucamber veg
CodePudding user response:
Items with the same text but different values can be extracted as follows:
df.groupby('text').filter(lambda x: x['label'].nunique() > 1)
result
text label
2 potato fruit
3 potato veg
Change is impossible without logic. You need to create logic for how to change
Update
don need filtering dataframe for change
just make conditions and mask
- cond1 : same text but different values
- cond2 : label is fruit
then fruit of item more than one label is replaced with veg.
cond1 = df.groupby('text')['label'].transform(lambda x: x.nunique() > 1)
cond2 = df['label'].eq('fruit')
df['label'] = df['label'].mask(cond1 & cond2, 'veg')
result:
text label
0 bannana fruit
1 tomato veg
2 potato veg
3 potato veg
4 lemon fruit
5 cucamber veg
CodePudding user response:
This can be accomplished with the following code:
df.sort_values(by="label", ascending=False).groupby("text").label.first()
df["label"] = df["text"].map(dict(zip(df.text, df.label)))
Let's take a look at what's going on here:
- First we sort the dataframe by labels in lexicographic descending order: all rows labelled with
"veg"
will appear before rows labelled with"fruit"
. - We then group by text, collapsing the rows with the same
"text"
value (in this example,potato
). - For each group, we take the first element: as the dataframe is sorted, if
"veg"
is present in the group, it will be chosen.
That give us df_map
, a dataframe containing mappings from text to label. We can then convert it to a dictionary and apply these mappings to the original dataframe using the DataFrame.map
method.
Note:
Something handy about this approach is that it's very simple to extend if you have more labels than "fruit"
and "veg"
and want to define a custom label priority:
order = {"fruit":0, "veg":1, "something_that_should_supersede_veg":2}
df_map = df.sort_values(by="label", key=lambda x:x.map(order), ascending=False).groupby("text").label.first()