This is the second part of this question.
Suppose I have a dataframe df
and I want to select x1
and x100
, corresponding to the largest amount
, grouped by group_id
. If there are multiple rows with the largest amount
, then I want to select medians of x1
and x100
.
df = pd.DataFrame({'group_id' : [1,1,1,2,2,3,3,3,3],
'amount' : [2,4,5,1,2,3,5,5,5],
'x1':[2,5,8,3,6,9,3,1,0],
'x100':[1,2,3,4,8,9,9,4,5]})
group_id amount x1 x100
0 1 2 2 1
1 1 4 5 2
2 1 5 8 3
3 2 1 3 4
4 2 2 6 8
5 3 3 9 9
6 3 5 3 9
7 3 5 1 4
8 3 5 0 5
So the desired output looks like this:
median_x1 median_x100
group_id
1 8.0 3.0
2 6.0 8.0
3 1.0 5.0
For only 2 columns (x1
and x100
), I can simply add 1 line to @AndrejKesely solution to the previous question, something like this:
out = df.groupby("group_id").apply(
lambda x: pd.Series(
{"median_x1": (d := x.loc[x["amount"] == x["amount"].max()])['x1'].median(),
"median_x100": d["x100"].median()}
)
)
How to do this in a neat way, which will work for 100 columns, i.e., x1
, x2
up to x100
? Ideally, I do not want to copypaste one line 100 times and manually changing name of a column in an editor...
CodePudding user response:
Maybe something like this?
df.groupby('group_id').apply(
lambda x: x[x['amount'] == x['amount'].max()
].drop(columns=['amount', 'group_id']).median())
You can also use names of columns instead of .drop():
df.groupby('group_id').apply(
lambda x: x.loc[x['amount'] == x['amount'].max(), ['x1', 'x100']].median())