Home > Net >  Pandas group by x, sort by y, select z, aggregating in case of multiple maximum values, multiple col
Pandas group by x, sort by y, select z, aggregating in case of multiple maximum values, multiple col

Time:08-04

This is the second part of this question.

Suppose I have a dataframe df and I want to select x1 and x100, corresponding to the largest amount, grouped by group_id. If there are multiple rows with the largest amount, then I want to select medians of x1 and x100.

df = pd.DataFrame({'group_id' : [1,1,1,2,2,3,3,3,3],
                   'amount'  : [2,4,5,1,2,3,5,5,5],
                   'x1':[2,5,8,3,6,9,3,1,0],
                  'x100':[1,2,3,4,8,9,9,4,5]})

   group_id  amount  x1  x100
0         1       2   2     1
1         1       4   5     2
2         1       5   8     3
3         2       1   3     4
4         2       2   6     8
5         3       3   9     9
6         3       5   3     9
7         3       5   1     4
8         3       5   0     5

So the desired output looks like this:

          median_x1  median_x100
group_id                        
1               8.0          3.0
2               6.0          8.0
3               1.0          5.0

For only 2 columns (x1 and x100), I can simply add 1 line to @AndrejKesely solution to the previous question, something like this:

out = df.groupby("group_id").apply(
    lambda x: pd.Series(
        {"median_x1": (d := x.loc[x["amount"] == x["amount"].max()])['x1'].median(),
         "median_x100": d["x100"].median()}
    )
)

How to do this in a neat way, which will work for 100 columns, i.e., x1, x2 up to x100? Ideally, I do not want to copypaste one line 100 times and manually changing name of a column in an editor...

CodePudding user response:

Maybe something like this?

df.groupby('group_id').apply(
    lambda x: x[x['amount'] == x['amount'].max()
               ].drop(columns=['amount', 'group_id']).median())

You can also use names of columns instead of .drop():

df.groupby('group_id').apply(
    lambda x: x.loc[x['amount'] == x['amount'].max(), ['x1', 'x100']].median())
  • Related