Considering the following datafrme:
id | cat | date | max | score | |
---|---|---|---|---|---|
1 | s1 | A | 12/06 | 9 | 5.4 |
2 | s1 | B | 12/06 | 10 | 5.4 |
3 | s2 | C | 11/04 | 13 | 4.2 |
4 | s2 | D | 11/04 | 28 | 10 |
5 | s3 | E | 08/02 | 16 | 5.4 |
5 | s3 | F | 08/02 | 6 | 5.4 |
I want to group by "id" and get the first of the columns [cat, date] but at the same time I want to get the mean of the max - score columns. I think it can be done with an aggregate:
but with aggregate I can't access all the columns of the dataframe.
id | cat | date | mean(max-score) |
---|---|---|---|
s1 | A | 12/06 | 4.1 |
s2 | C | 11/04 | 13.4 |
s3 | E | 08/02 | 5.6 |
CodePudding user response:
You first need to assign max-score
to mean(max-score)
, then this is a simple groupby
agg
:
(df.assign(**{'mean(max-score)': df['max']-df['score']})
.groupby('id', as_index=False)
.agg({'cat': 'first', 'date': 'first', 'mean(max-score)': 'mean'})
)
output:
id cat date mean(max-score)
0 s1 A 12/06 4.1
1 s2 C 11/04 13.4
2 s3 E 08/02 5.6
CodePudding user response:
From pandas 0.25 onwards, you can also use pd.namedAgg, a very useful function:
df['diff'] = df['max'] - df['score']
df.groupby('id').agg(cat=pd.NamedAgg(columns='cat',aggfunc='first'),
date=pd.NamedAgg(columns='date',aggfunc='first'),
mean(max-score)=pd.NamedAgg(columns='diff',aggfunc='mean'))