Given some data like this:
import pandas as pd
import seaborn as sns
from random import choices
iris = sns.load_dataset('iris')[['petal_width', 'species']]
iris['color'] = choices(['red', 'blue'], k=len(iris))
print(iris.sample(n=15).reset_index(drop=True))
petal_width species color
0 0.2 setosa blue
1 2.2 virginica blue
2 0.6 setosa red
3 2.1 virginica blue
4 1.5 versicolor blue
5 1.0 versicolor red
6 0.2 setosa red
7 1.5 versicolor blue
8 2.0 virginica blue
9 1.8 virginica blue
10 0.2 setosa red
11 1.2 versicolor blue
12 0.2 setosa blue
13 1.2 versicolor red
14 0.3 setosa blue
I need to find the maximum petal width for each species only considering blue observations, and assign that maximum to a new column max_width
for each species. So since there are three unique species, there will be three unique values in the max_width
column (unless two or more of them just happen to be the same).
It would be easy to find the maximum for every species/color combination:
iris['max_width'] = iris.groupby(['species', 'color']).petal_width.transform('max')
And of course it would be easy to aggregate by species regardless of color. Is there a simple way of ignoring red cases but still assigning them the relevant maximum value of the blue observations for each species?
CodePudding user response:
IIUC, You can try using map
combined with groupby()
:
iris['max_width'] = iris['species'].map(iris[iris['color'] == 'blue'].groupby(['species']).petal_width.max().to_dict())
Sampling 15 rows, returns:
petal_width species color max_width
86 1.5 versicolor red 1.6
103 1.8 virginica blue 2.5
122 2.0 virginica red 2.5
102 2.1 virginica blue 2.5
43 0.6 setosa red 0.5
35 0.2 setosa blue 0.5
11 0.2 setosa red 0.5
20 0.2 setosa blue 0.5
145 2.3 virginica blue 2.5
85 1.6 versicolor red 1.6
7 0.2 setosa red 0.5
18 0.3 setosa blue 0.5
147 2.0 virginica blue 2.5
117 2.2 virginica blue 2.5
63 1.4 versicolor red 1.6