I have the following pandas dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,2,2], 'perc':[0.1,0.2,0.3,0.4], 'category':['a','b','b','a']})
I would like to create an extra column which will be the category
that corresponds to the highest perc
by id
.
The resulting dataframe looks like:
foo = pd.DataFrame({'id': [1,1,2,2], 'perc':[0.1,0.2,0.3,0.4], 'category':['a','b','b','a'], 'category_max':['b','b','a','a']})
Any ideas how I can do that ?
CodePudding user response:
Use GroupBy.transform
with convert category
to index, so for maximal caegory by perc
per group use DataFrameGroupBy.idxmax
, last for assign back is necessary convert ouput to numpy array for prevent alignment (because is different index after transform
like in original df
):
foo['category_max'] = (foo.set_index('category')
.groupby('id')['perc']
.transform('idxmax')
.to_numpy())
print (foo)
id perc category category_max
0 1 0.1 a b
1 1 0.2 b b
2 2 0.3 b a
3 2 0.4 a a
Or is possible mapping by Series.map
by aggregate values by DataFrameGroupBy.idxmax
:
foo['category_max'] = foo['id'].map(foo.set_index('category')
.groupby('id')['perc']
.idxmax())
print (foo)
id perc category category_max
0 1 0.1 a b
1 1 0.2 b b
2 2 0.3 b a
3 2 0.4 a a
CodePudding user response:
You could use a GroupBy.transform('idxmax')
:
foo['category_max'] = (foo
.set_index('category')
.groupby('id')['perc']
.transform('idxmax').values
)
output:
id perc category category_max
0 1 0.1 a b
1 1 0.2 b b
2 2 0.3 b a
3 2 0.4 a a