Home > database >  Find the average value of n largest in a month, but day has to be unique (Pandas)
Find the average value of n largest in a month, but day has to be unique (Pandas)

Time:07-02

How can I find the average value of n largest values in a month, but day has to be unique?

I do have a timestamp column as well, but I would guess making columns of them is the way to go?

I tried df['peak_avg'] = df.groupby(['month', 'day'])['value'].transform(lambda x: x.nlargest(3).mean()), but this takes the average of the three largest days.

month day value peak_avg (expected)
1 1 35 35
1 1 30 35
2 1 34 28.5
2 2 23 28.5
3 1 98 97
3 2 96. 97

CodePudding user response:

IIUC, you can drop the duplicate in month and day columns and at last fill them

df['peak_avg'] = (df.sort_values(['month', 'day', 'value'], ascending=[True, True, False])
                  .drop_duplicates(['month', 'day'])
                  .groupby(['month'])['value']
                  .transform(lambda x: x.head(3).mean()))
df['peak_avg'] = df.groupby(['month', 'day'])['peak_avg'].apply(lambda g: g.ffill().bfill())
print(df)

   month  day  value  peak_avg
0      1    1     35      35.0
1      1    1     12      35.0
2      2    1     34      28.5
3      2    3     23      28.5
4      3    1     98      98.0
5      3    2     98      98.0

CodePudding user response:

Solution

Pivot the dataframe with aggfunc max, then sort and select the top three columns and use nanmean along columns axis to calculate average

s = df.pivot_table('value', 'month', 'day', 'max')
s['avg'] = np.nanmean(np.sort(-s, 1)[:, :3] * -1, 1)
df['avg'] = df['month'].map(s['avg'])

   month  day  value  peak_avg (expected)   avg
0      1    1   35.0                 35.0  35.0
1      1    1   30.0                 35.0  35.0
2      2    1   34.0                 28.5  28.5
3      2    2   23.0                 28.5  28.5
4      3    1   98.0                 97.0  97.0
5      3    2   96.0                 97.0  97.0

CodePudding user response:

You can first derive max value for a day and you should group only by month, since you want to take average of month.

df['max_value'] = df.groupby(['month', 'day']).value.transform(max)

df['peak_avg'] = df.groupby('month').value.transform(lambda x: x.nlargest(3).mean())
  • Related