Home > database >  Pandas group by x, sort by y, select z, aggregating in case of multiple maximum values
Pandas group by x, sort by y, select z, aggregating in case of multiple maximum values

Time:08-03

Suppose I have a dataframe df:

df = pd.DataFrame({'group_id' : [1,1,1,2,2,3,3,3,3],
                   'amount'  : [2,4,5,1,2,3,5,5,5],
                   'x':[2,5,8,3,6,9,3,1,0]})

   group_id  amount  x
0         1       2  2
1         1       4  5
2         1       5  8
3         2       1  3
4         2       2  6
5         3       3  9
6         3       5  3
7         3       5  1
8         3       5  0

I want to group it by group_id, then pick x, corresponding to the largest amount. The part which I cannot figure out is how to deal with cases when there are multiple rows with maximum amount. For example, 3 last rows in the df above. In such cases, I would like to aggregate values of x using mean, median or mode of x. I am trying to get the solution, in which I can implement every one of these 3 aggregation methods.

I saw many questions here, which solve the problem without dealing with multiple maximum values. For example, I could do something like this:

df.sort_values('amount', ascending=False).groupby('group_id').first().x

But I do not know how to implement different aggregation approaches.

CodePudding user response:

If I understand you question right, you can use custom function with GroupBy.apply:

out = df.groupby("group_id").apply(
    lambda x: pd.Series(
        {
            "mean": (d := x.loc[x["amount"] == x["amount"].max(), "x"]).mean(),
            "median": d.median(),
            "mode": d.mode()[0],
        }
    )
)
print(out)

Prints:

              mean  median  mode
group_id                        
1         8.000000     8.0   8.0
2         6.000000     6.0   6.0
3         1.333333     1.0   0.0

Or .describe():

out = df.groupby("group_id").apply(
    lambda x: x.loc[x["amount"] == x["amount"].max(), "x"].describe()
)
print(out)

Prints:

x         count      mean       std  min  25%  50%  75%  max
group_id                                                    
1           1.0  8.000000       NaN  8.0  8.0  8.0  8.0  8.0
2           1.0  6.000000       NaN  6.0  6.0  6.0  6.0  6.0
3           3.0  1.333333  1.527525  0.0  0.5  1.0  2.0  3.0
  • Related