I want to retain the string with the largest value based on a dictionary's key and value. Any suggestion to how to do it effectively?
fruit_dict = {
"Apple": 10,
"Watermelon": 20,
"Cherry": 30
}
df = pd.DataFrame(
{
"ID": [1, 2, 3, 4, 5],
"name": [
"Apple, Watermelon",
"Cherry, Watermelon",
"Apple",
"Cherry, Apple",
"Cherry",
],
}
)
ID name
0 1 Apple, Watermelon
1 2 Cherry, Watermelon
2 3 Apple
3 4 Cherry, Apple
4 5 Cherry
Expected output:
ID name
0 1 Watermelon
1 2 Cherry
2 3 Apple
3 4 Cherry
4 5 Cherry
CodePudding user response:
One way it to use apply
with max
and fruit_dict.get
as key:
new_df = (df.assign(name=df['name'].str.split(', ')
.apply(lambda l: max(l, key=fruit_dict.get)))
)
or, if you expect some names to be missing from the dictionary:
new_df = (df.assign(name=df['name'].str.split(', ')
.apply(lambda l: max(l, key=lambda x: fruit_dict.get(x, float('-inf'))))
)
output:
ID name
0 1 Watermelon
1 2 Cherry
2 3 Apple
3 4 Cherry
4 5 Cherry
CodePudding user response:
Use:
df = (df.assign(name= df['name'].str.split(', '))
.explode('name')
.assign(new = lambda x: x['name'].map(fruit_dict))
.sort_values(['ID', 'new'], ascending=[True, False])
.drop_duplicates('ID')
)
print (df)
ID name new
0 1 Watermelon 20
1 2 Cherry 30
2 3 Apple 10
3 4 Cherry 30
4 5 Cherry 30
Or:
df['new'] = df['name'].apply(lambda x: max(x.split(', '), key=fruit_dict.get))
print (df)
ID name new
0 1 Apple, Watermelon Watermelon
1 2 Cherry, Watermelon Cherry
2 3 Apple Apple
3 4 Cherry, Apple Cherry
4 5 Cherry Cherry
EDIT: If no match is returned first value:
fruit_dict = {
"Apple": 10,
"Watermelon": 20,
"Cherry": 30
}
df = pd.DataFrame(
{
"ID": [1, 2, 3, 4, 5],
"name": [
"Apple, Watermelon",
"Cherry, Watermelon",
"Apple",
"Cherry, Apple",
"ooo, Cherry2, aaaa", <- changed data
],
}
)
print (df)
df1 = (df.assign(name= df['name'].str.split(', '))
.explode('name')
.assign(new = lambda x: x['name'].map(fruit_dict))
.sort_values(['ID', 'new'], ascending=[True, False])
.drop_duplicates('ID')
)
print (df1)
ID name new
0 1 Watermelon 20.0
1 2 Cherry 30.0
2 3 Apple 10.0
3 4 Cherry 30.0
4 5 ooo NaN
If need NaN
s if no match:
df1['name'] = df1['name'].mask(df1.pop('new').isna())
print (df1)
ID name
0 1 Watermelon
1 2 Cherry
2 3 Apple
3 4 Cherry
4 5 NaN
df['new1'] = df['name'].apply(lambda x: max(x.split(', '), key=lambda x: fruit_dict.get(x, float('-inf'))))
df['new2'] = df['name'].apply(lambda x: max(x.split(', '), key=lambda x: fruit_dict.get(x, 0)))
df['new3'] = df['name'].apply(lambda x: max(x.split(', '), key=lambda x: fruit_dict.get(x, 1000)))
print (df)
ID name new1 new2 new3
0 1 Apple, Watermelon Watermelon Watermelon Watermelon
1 2 Cherry, Watermelon Cherry Cherry Cherry
2 3 Apple Apple Apple Apple
3 4 Cherry, Apple Cherry Cherry Cherry
4 5 ooo, Cherry2, aaaa ooo ooo ooo