Find the average value of n largest in a month, but day has to be unique (Pandas)-CodePudding

How can I find the average value of n largest values in a month, but day has to be unique?

I do have a timestamp column as well, but I would guess making columns of them is the way to go?

I tried df['peak_avg'] = df.groupby(['month', 'day'])['value'].transform(lambda x: x.nlargest(3).mean()), but this takes the average of the three largest days.

month	day	value	peak_avg (expected)
1	1	35	35
1	1	30	35
2	1	34	28.5
2	2	23	28.5
3	1	98	97
3	2	96.	97

CodePudding user response：

IIUC, you can drop the duplicate in month and day columns and at last fill them

df['peak_avg'] = (df.sort_values(['month', 'day', 'value'], ascending=[True, True, False])
                  .drop_duplicates(['month', 'day'])
                  .groupby(['month'])['value']
                  .transform(lambda x: x.head(3).mean()))
df['peak_avg'] = df.groupby(['month', 'day'])['peak_avg'].apply(lambda g: g.ffill().bfill())

print(df)

   month  day  value  peak_avg
0      1    1     35      35.0
1      1    1     12      35.0
2      2    1     34      28.5
3      2    3     23      28.5
4      3    1     98      98.0
5      3    2     98      98.0

CodePudding user response：

Solution

Pivot the dataframe with aggfunc max, then sort and select the top three columns and use nanmean along columns axis to calculate average

s = df.pivot_table('value', 'month', 'day', 'max')
s['avg'] = np.nanmean(np.sort(-s, 1)[:, :3] * -1, 1)
df['avg'] = df['month'].map(s['avg'])

   month  day  value  peak_avg (expected)   avg
0      1    1   35.0                 35.0  35.0
1      1    1   30.0                 35.0  35.0
2      2    1   34.0                 28.5  28.5
3      2    2   23.0                 28.5  28.5
4      3    1   98.0                 97.0  97.0
5      3    2   96.0                 97.0  97.0

CodePudding user response：

You can first derive max value for a day and you should group only by month, since you want to take average of month.

df['max_value'] = df.groupby(['month', 'day']).value.transform(max)

df['peak_avg'] = df.groupby('month').value.transform(lambda x: x.nlargest(3).mean())