Suppose I have a dataframe df
:
df = pd.DataFrame({'group_id' : [1,1,1,2,2,3,3,3,3],
'amount' : [2,4,5,1,2,3,5,5,5],
'x':[2,5,8,3,6,9,3,1,0]})
group_id amount x
0 1 2 2
1 1 4 5
2 1 5 8
3 2 1 3
4 2 2 6
5 3 3 9
6 3 5 3
7 3 5 1
8 3 5 0
I want to group it by group_id
, then pick x
, corresponding to the largest amount
. The part which I cannot figure out is how to deal with cases when there are multiple rows with maximum amount
. For example, 3 last rows in the df above.
In such cases, I would like to aggregate values of x
using mean, median or mode of x. I am trying to get the solution, in which I can implement every one of these 3 aggregation methods.
I saw many questions here, which solve the problem without dealing with multiple maximum values. For example, I could do something like this:
df.sort_values('amount', ascending=False).groupby('group_id').first().x
But I do not know how to implement different aggregation approaches.
CodePudding user response:
If I understand you question right, you can use custom function with GroupBy.apply
:
out = df.groupby("group_id").apply(
lambda x: pd.Series(
{
"mean": (d := x.loc[x["amount"] == x["amount"].max(), "x"]).mean(),
"median": d.median(),
"mode": d.mode()[0],
}
)
)
print(out)
Prints:
mean median mode
group_id
1 8.000000 8.0 8.0
2 6.000000 6.0 6.0
3 1.333333 1.0 0.0
Or .describe()
:
out = df.groupby("group_id").apply(
lambda x: x.loc[x["amount"] == x["amount"].max(), "x"].describe()
)
print(out)
Prints:
x count mean std min 25% 50% 75% max
group_id
1 1.0 8.000000 NaN 8.0 8.0 8.0 8.0 8.0
2 1.0 6.000000 NaN 6.0 6.0 6.0 6.0 6.0
3 3.0 1.333333 1.527525 0.0 0.5 1.0 2.0 3.0