pandas rolling mean with conditions-CodePudding

I have a data set that contains information about the number of logins each user make for each month. I also have some months that are selected using external logics. I want to calculate the mean for the next 4 months for those months that need to be tested, avoiding to count logins from different users.

Here some sample code:

import pandas as pd

data = [
{ 'user': 'Mick', 'Month' : 4, 'logins': 5, 'test': True },
{ 'user': 'Mick', 'Month' : 5, 'logins': 4, 'test': False },
{ 'user': 'Mick', 'Month' : 6, 'logins': 7, 'test': False },
{ 'user': 'Mick', 'Month' : 7, 'logins': 5, 'test': True },
{ 'user': 'Mick', 'Month' : 8, 'logins': 12, 'test': False },
{ 'user': 'Mick', 'Month' : 9, 'logins': 8, 'test': False },
{ 'user': 'Tom', 'Month' : 4, 'logins': 15, 'test': True },
{ 'user': 'Tom', 'Month' : 5, 'logins': 24, 'test': False },
{ 'user': 'Tom', 'Month' : 6, 'logins': 17, 'test': False },
{ 'user': 'Tom', 'Month' : 7, 'logins': 22, 'test': False },
{ 'user': 'Tom', 'Month' : 8, 'logins': 20, 'test': False },
{ 'user': 'Tom', 'Month' : 9, 'logins': 18, 'test': True }
]

df = pd.DataFrame(data)

I came out with this solution, but it is wrong, since it does not calculate the mean and do not limit the rolling when the user change

df['mean'] = df[df['test'] == True]['logins'].rolling(4).mean()
print(df)

My expected result is like this. Please note: in row 3, the mean is calculated only on 3 rows, because of the change of user:

    user  Month  logins   test  mean
0   Mick      4       5   True   5.25     <-- mean of this row and the next 3
1   Mick      5       4  False   NaN
2   Mick      6       7  False   NaN
3   Mick      7       5   True   8.3333   <-- mean of this row and the next 2, because the user is different
4   Mick      8       2  False   NaN
5   Mick      9       8  False   NaN
6    Tom      4      15   True   19.5     <-- mean of this row and the next 3
7    Tom      5      24  False   NaN
8    Tom      6      17  False   NaN
9    Tom      7      22  False   NaN
10   Tom      8      20  False   NaN
11   Tom      9      18   True   18       <-- mean of this row alone because is the last one

CodePudding user response：

You can do it with reversing all the data to make a forward rolling, then groupby user and perform the rolling mean. You need to get back to the original order of data by arranging the index and finally mask the rows where test is not True. So:

df['mean_logins'] = (
    df.iloc[::-1] # reverse order for forward rolling
      .groupby('user') # perform the operation per user
      .rolling(4, min_periods=1) #min_periods=1 to get a result even if not 3 rows after 
      ['logins'].mean()
      .reset_index(level=0, drop=True) # drop the group name
      .sort_index() # to get original order
      .where(df['test']) # mask the results for the False test
)
print(df)
#     user  Month  logins   test  mean_logins
# 0   Mick      4       5   True     5.250000
# 1   Mick      5       4  False          NaN
# 2   Mick      6       7  False          NaN
# 3   Mick      7       5   True     8.333333
# 4   Mick      8      12  False          NaN
# 5   Mick      9       8  False          NaN
# 6    Tom      4      15   True    19.500000
# 7    Tom      5      24  False          NaN
# 8    Tom      6      17  False          NaN
# 9    Tom      7      22  False          NaN
# 10   Tom      8      20  False          NaN
# 11   Tom      9      18   True    18.000000

CodePudding user response：

Select only rows with Trues by swap ordering of rows by DataFrame.iloc for forward rolling, use Series.rolling per groups and remove first level of MultiIndex by Series.droplevel:

df.loc[df['test'], 'new'] = (df.iloc[::-1]    
                               .groupby('user')['logins']
                               .rolling(4, min_periods=1)
                               .mean()
                               .droplevel(0))
print(df)
    user  Month  logins   test        new
0   Mick      4       5   True   5.250000
1   Mick      5       4  False        NaN
2   Mick      6       7  False        NaN
3   Mick      7       5   True   8.333333
4   Mick      8      12  False        NaN
5   Mick      9       8  False        NaN
6    Tom      4      15   True  19.500000
7    Tom      5      24  False        NaN
8    Tom      6      17  False        NaN
9    Tom      7      22  False        NaN
10   Tom      8      20  False        NaN
11   Tom      9      18   True  18.000000