I have the following data frame:
HorseId | FGrating | Average FGrating |
---|---|---|
1736 | 110 | -1 |
1736 | 124 | -1 |
1736 | 118 | -1 |
13973 | 144 | -1 |
13973 | 137 | -1 |
I want to fill the Average FGrating
column with the average FGrating of every horse up to the point it was computed, grouped by HorseId
. The result that I am looking for is this:
HorseId | FGrating | Average FGrating |
---|---|---|
1736 | 110 | 110 |
1736 | 124 | 117 (110 124)/2 |
1736 | 118 | 117.3 (110 124 118)/3 |
13973 | 144 | 144 |
13973 | 137 | 140.5 (144 137)/2 |
The code I used to solve this problem is:
featured_data['Average FGrating'] = featured_data[['HorseId', 'FGrating']].groupby('HorseId')[
'FGrating'].mean()
However, it computes the desired average in only a part of the data frame not in all of it.
What am I doing wrong? How can I solve this problem?
CodePudding user response:
We can start by grouping by HorseId
and then get the cumsum
of FGrating
. To get the average, we just need to divide the cumsum
with a cumcount
like so :
>>> df_grouped = df.groupby('HorseId')['FGrating']
>>> df['cum_sum'] = df_grouped.apply(lambda p: p.shift(fill_value=0).cumsum())
>>> df['cum_mean'] = df['cum_sum'] / df_grouped.cumcount()
>>> df['cum_mean'].fillna(df['FGrating'], inplace=True)
>>> df
HorseId FGrating Average FGrating cum_sum cum_mean
0 1736 110 -1 0 110.0
1 1736 124 -1 110 110.0
2 1736 118 -1 234 117.0
3 13973 144 -1 0 144.0
4 13973 137 -1 144 144.0
Or we can also do it this way (shorter) :
df['cum_mean'] = (
df.groupby('HorseId')['FGrating'].apply(lambda x: x.shift().expanding().mean()))