Home > database >  How can I compute the rolling mean of a column for a set period of time, using Pandas and groupby?
How can I compute the rolling mean of a column for a set period of time, using Pandas and groupby?

Time:11-17

I have the following DataFrame:

Date Jockey ID Position
23-12-2018 4340 1
25-11-2018 4340 5
19-12-2018 4340 10
01-01-2019 4340 3
18-10-2017 8443 1
18-02-2018 8443 6
12-05-2018 8443 7

I want to compute the rolling mean final position for each Jockey ID for the last 1000 days. I am looking for something like this:

Date Jockey ID Position Mean Position
23-12-2018 4340 1 1 (1/1)
25-11-2018 4340 5 3 (1 5)/2
19-12-2018 4340 10 5.33 (1 5 10)/3
01-01-2019 4340 3 4.75 (1 5 10 3)/4
18-10-2017 8443 1 1 (1/1)
18-02-2018 8443 6 3.5 (1 6)/2
12-05-2018 8443 7 4.66 (1 6 7)/3

Any ideas on how to do it?

CodePudding user response:

Use:

df['Date'] = pd.to_datetime(df['Date'])

#here freq not raise error, but also not working
df['new'] = (df.set_index('Date')
               .groupby('Jockey ID', sort=False)['Position']
               .expanding(freq='1000D')
               .mean()
               .to_numpy())
print (df)
print (df)
        Date  Jockey ID  Position       new
0 2018-12-23       4340         1  1.000000
1 2018-11-25       4340         5  3.000000
2 2018-12-19       4340        10  5.333333
3 2019-01-01       4340         3  4.750000
4 2017-10-18       8443         1  1.000000
5 2018-02-18       8443         6  3.500000
6 2018-12-05       8443         7  4.666667
#for any freq same ouput
df['new'] = (df.set_index('Date')
               .groupby('Jockey ID', sort=False)['Position']
               .expanding(freq='30D')
               .mean()
               .to_numpy())
print (df)
        Date  Jockey ID  Position       new
0 2018-12-23       4340         1  1.000000
1 2018-11-25       4340         5  3.000000
2 2018-12-19       4340        10  5.333333
3 2019-01-01       4340         3  4.750000
4 2017-10-18       8443         1  1.000000
5 2018-02-18       8443         6  3.500000
6 2018-12-05       8443         7  4.666667

#here freq not raise error, but also not working same output like no freq
df['new'] = (df.set_index('Date')
               .groupby('Jockey ID', sort=False)['Position']
               .expanding()
               .mean()
               .to_numpy())
print (df)
        Date  Jockey ID  Position       new
0 2018-12-23       4340         1  1.000000
1 2018-11-25       4340         5  3.000000
2 2018-12-19       4340        10  5.333333
3 2019-01-01       4340         3  4.750000
4 2017-10-18       8443         1  1.000000
5 2018-02-18       8443         6  3.500000
6 2018-12-05       8443         7  4.666667

Possible solution with Grouper and GroupBy.transform:

df['new'] = (df.set_index('Date')
               .groupby(['Jockey ID', pd.Grouper(freq='1000D')])['Position']
               .transform(lambda x: x.expanding().mean())
               .to_numpy())
print (df)
        Date  Jockey ID  Position       new
0 2018-12-23       4340         1  1.000000
1 2018-11-25       4340         5  3.000000
2 2018-12-19       4340        10  5.333333
3 2019-01-01       4340         3  4.750000
4 2017-10-18       8443         1  1.000000
5 2018-02-18       8443         6  3.500000
6 2018-12-05       8443         7  4.666667
  • Related