How can I find the average value of n largest values in a month, but day has to be unique?
I do have a timestamp column as well, but I would guess making columns of them is the way to go?
I tried df['peak_avg'] = df.groupby(['month', 'day'])['value'].transform(lambda x: x.nlargest(3).mean())
, but this takes the average of the three largest days.
month | day | value | peak_avg (expected) |
---|---|---|---|
1 | 1 | 35 | 35 |
1 | 1 | 30 | 35 |
2 | 1 | 34 | 28.5 |
2 | 2 | 23 | 28.5 |
3 | 1 | 98 | 97 |
3 | 2 | 96. | 97 |
CodePudding user response:
IIUC, you can drop the duplicate in month
and day
columns and at last fill them
df['peak_avg'] = (df.sort_values(['month', 'day', 'value'], ascending=[True, True, False])
.drop_duplicates(['month', 'day'])
.groupby(['month'])['value']
.transform(lambda x: x.head(3).mean()))
df['peak_avg'] = df.groupby(['month', 'day'])['peak_avg'].apply(lambda g: g.ffill().bfill())
print(df)
month day value peak_avg
0 1 1 35 35.0
1 1 1 12 35.0
2 2 1 34 28.5
3 2 3 23 28.5
4 3 1 98 98.0
5 3 2 98 98.0
CodePudding user response:
Solution
Pivot
the dataframe with aggfunc max
, then sort and select the top three columns and use nanmean
along columns axis to calculate average
s = df.pivot_table('value', 'month', 'day', 'max')
s['avg'] = np.nanmean(np.sort(-s, 1)[:, :3] * -1, 1)
df['avg'] = df['month'].map(s['avg'])
month day value peak_avg (expected) avg
0 1 1 35.0 35.0 35.0
1 1 1 30.0 35.0 35.0
2 2 1 34.0 28.5 28.5
3 2 2 23.0 28.5 28.5
4 3 1 98.0 97.0 97.0
5 3 2 96.0 97.0 97.0
CodePudding user response:
You can first derive max value for a day and you should group only by month
, since you want to take average of month.
df['max_value'] = df.groupby(['month', 'day']).value.transform(max)
df['peak_avg'] = df.groupby('month').value.transform(lambda x: x.nlargest(3).mean())