I am trying to calculate an expanding window mean grouped by dates. As an example,
df_example = pd.DataFrame({
'group' : ['a','a','b','b','a','a','b','a','b'],
'date': ['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02','2021-01-02','2021-01-03','2021-01-04'],
'val' : [True, True, False, True, False, False, True, True,False]
})
group date val
0 a 2021-01-01 True
1 a 2021-01-01 True
2 b 2021-01-01 False
3 b 2021-01-01 True
4 a 2021-01-02 False
5 a 2021-01-02 False
6 b 2021-01-02 True
7 a 2021-01-03 True
8 b 2021-01-04 False
For this, the logic I am looking for is: for each group and date, calculate the mean over all values less than or equal to the date. I have managed to do this as follows, but looking for a cleaner and more efficient way to approach it.
df1 = df_example.groupby(['group','date']).sum().groupby('group').cumsum()
df2 = df_example.groupby(['group','date']).count().groupby('group').cumsum()
df_result = df1 / df2
val
group date
a 2021-01-01 1.000000
2021-01-02 0.500000
2021-01-03 0.600000
b 2021-01-01 0.500000
2021-01-02 0.666667
2021-01-04 0.500000
In addition to this, I would like to expand the dataframe to have observations for each date consistently across the groups, using a look back, e.g. if there was no observation on the date use the most recent value before.
val
group date
a 2021-01-01 1.000000
2021-01-02 0.500000
2021-01-03 0.600000
2021-01-04 0.600000
b 2021-01-01 0.500000
2021-01-02 0.666667
2021-01-03 0.666667
2021-01-04 0.500000
Preferably this would be done efficiently and in a more readable way, as I am looking to perform similar logic and calculations for a number of groups and values.
CodePudding user response:
We can use pivot_table
s = df_example.pivot_table('val', 'group', 'date', ['sum', 'count'])
s = s['sum'].cumsum(1).div(s['count'].cumsum(1)).ffill(axis=1).stack()
Explanations
- Reshape the dataframe using
pivot_table
with aggfuncsum
andcount
sum count
date 2021-01-01 2021-01-02 2021-01-03 2021-01-04 2021-01-01 2021-01-02 2021-01-03 2021-01-04
group
a 2.0 0.0 1.0 NaN 2.0 2.0 1.0 NaN
b 1.0 1.0 NaN 0.0 2.0 1.0 NaN 1.0
- Calculate cumulative sum on
sum
andcount
columns along the columns axis, then dividesum
bycount
to calculate expanding mean
date 2021-01-01 2021-01-02 2021-01-03 2021-01-04
group
a 1.0 0.500000 0.6 NaN
b 0.5 0.666667 NaN 0.5
- Forward fill the mean values along the columns axis
date 2021-01-01 2021-01-02 2021-01-03 2021-01-04
group
a 1.0 0.500000 0.600000 0.6
b 0.5 0.666667 0.666667 0.5
- Stack the dataframe to reshape into multiindex series
group date
a 2021-01-01 1.000000
2021-01-02 0.500000
2021-01-03 0.600000
2021-01-04 0.600000
b 2021-01-01 0.500000
2021-01-02 0.666667
2021-01-03 0.666667
2021-01-04 0.500000
dtype: float64