Home > Software engineering >  Pandas - Calculate mean for group over expanding window of dates
Pandas - Calculate mean for group over expanding window of dates

Time:09-26

I am trying to calculate an expanding window mean grouped by dates. As an example,

df_example = pd.DataFrame({
    'group' : ['a','a','b','b','a','a','b','a','b'],
    'date': ['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02','2021-01-02','2021-01-03','2021-01-04'],
    'val' : [True, True, False, True, False, False, True, True,False]
})

  group        date    val
0     a  2021-01-01   True 
1     a  2021-01-01   True 
2     b  2021-01-01  False 
3     b  2021-01-01   True 
4     a  2021-01-02  False 
5     a  2021-01-02  False
6     b  2021-01-02   True
7     a  2021-01-03   True
8     b  2021-01-04  False

For this, the logic I am looking for is: for each group and date, calculate the mean over all values less than or equal to the date. I have managed to do this as follows, but looking for a cleaner and more efficient way to approach it.

df1 = df_example.groupby(['group','date']).sum().groupby('group').cumsum() 
df2 = df_example.groupby(['group','date']).count().groupby('group').cumsum() 
df_result = df1 / df2


                       val
group date                
a     2021-01-01  1.000000
      2021-01-02  0.500000
      2021-01-03  0.600000
b     2021-01-01  0.500000
      2021-01-02  0.666667
      2021-01-04  0.500000

In addition to this, I would like to expand the dataframe to have observations for each date consistently across the groups, using a look back, e.g. if there was no observation on the date use the most recent value before.

                       val
group date                
a     2021-01-01  1.000000
      2021-01-02  0.500000
      2021-01-03  0.600000
      2021-01-04  0.600000
b     2021-01-01  0.500000
      2021-01-02  0.666667
      2021-01-03  0.666667
      2021-01-04  0.500000

Preferably this would be done efficiently and in a more readable way, as I am looking to perform similar logic and calculations for a number of groups and values.

CodePudding user response:

We can use pivot_table

s = df_example.pivot_table('val', 'group', 'date', ['sum', 'count'])
s = s['sum'].cumsum(1).div(s['count'].cumsum(1)).ffill(axis=1).stack()
Explanations
  1. Reshape the dataframe using pivot_table with aggfunc sum and count
      sum                                         count                                 
date  2021-01-01 2021-01-02 2021-01-03 2021-01-04 2021-01-01 2021-01-02 2021-01-03 2021-01-04
group                                                                                        
a            2.0        0.0        1.0        NaN        2.0        2.0        1.0        NaN
b            1.0        1.0        NaN        0.0        2.0        1.0        NaN        1.0
  1. Calculate cumulative sum on sum and count columns along the columns axis, then divide sum by count to calculate expanding mean
date   2021-01-01  2021-01-02  2021-01-03  2021-01-04
group                                                
a             1.0    0.500000         0.6         NaN
b             0.5    0.666667         NaN         0.5
  1. Forward fill the mean values along the columns axis
date   2021-01-01  2021-01-02  2021-01-03  2021-01-04
group                                                
a             1.0    0.500000    0.600000         0.6
b             0.5    0.666667    0.666667         0.5
  1. Stack the dataframe to reshape into multiindex series
group  date      
a      2021-01-01    1.000000
       2021-01-02    0.500000
       2021-01-03    0.600000
       2021-01-04    0.600000
b      2021-01-01    0.500000
       2021-01-02    0.666667
       2021-01-03    0.666667
       2021-01-04    0.500000
dtype: float64
  • Related