I have a data set that contains information about the number of logins each user make for each month. I also have some months that are selected using external logics. I want to calculate the mean for the next 4 months for those months that need to be tested, avoiding to count logins from different users.
Here some sample code:
import pandas as pd
data = [
{ 'user': 'Mick', 'Month' : 4, 'logins': 5, 'test': True },
{ 'user': 'Mick', 'Month' : 5, 'logins': 4, 'test': False },
{ 'user': 'Mick', 'Month' : 6, 'logins': 7, 'test': False },
{ 'user': 'Mick', 'Month' : 7, 'logins': 5, 'test': True },
{ 'user': 'Mick', 'Month' : 8, 'logins': 12, 'test': False },
{ 'user': 'Mick', 'Month' : 9, 'logins': 8, 'test': False },
{ 'user': 'Tom', 'Month' : 4, 'logins': 15, 'test': True },
{ 'user': 'Tom', 'Month' : 5, 'logins': 24, 'test': False },
{ 'user': 'Tom', 'Month' : 6, 'logins': 17, 'test': False },
{ 'user': 'Tom', 'Month' : 7, 'logins': 22, 'test': False },
{ 'user': 'Tom', 'Month' : 8, 'logins': 20, 'test': False },
{ 'user': 'Tom', 'Month' : 9, 'logins': 18, 'test': True }
]
df = pd.DataFrame(data)
I came out with this solution, but it is wrong, since it does not calculate the mean and do not limit the rolling when the user change
df['mean'] = df[df['test'] == True]['logins'].rolling(4).mean()
print(df)
My expected result is like this. Please note: in row 3, the mean is calculated only on 3 rows, because of the change of user:
user Month logins test mean
0 Mick 4 5 True 5.25 <-- mean of this row and the next 3
1 Mick 5 4 False NaN
2 Mick 6 7 False NaN
3 Mick 7 5 True 8.3333 <-- mean of this row and the next 2, because the user is different
4 Mick 8 2 False NaN
5 Mick 9 8 False NaN
6 Tom 4 15 True 19.5 <-- mean of this row and the next 3
7 Tom 5 24 False NaN
8 Tom 6 17 False NaN
9 Tom 7 22 False NaN
10 Tom 8 20 False NaN
11 Tom 9 18 True 18 <-- mean of this row alone because is the last one
CodePudding user response:
You can do it with reversing all the data to make a forward rolling, then groupby user and perform the rolling mean. You need to get back to the original order of data by arranging the index and finally mask the rows where test is not True. So:
df['mean_logins'] = (
df.iloc[::-1] # reverse order for forward rolling
.groupby('user') # perform the operation per user
.rolling(4, min_periods=1) #min_periods=1 to get a result even if not 3 rows after
['logins'].mean()
.reset_index(level=0, drop=True) # drop the group name
.sort_index() # to get original order
.where(df['test']) # mask the results for the False test
)
print(df)
# user Month logins test mean_logins
# 0 Mick 4 5 True 5.250000
# 1 Mick 5 4 False NaN
# 2 Mick 6 7 False NaN
# 3 Mick 7 5 True 8.333333
# 4 Mick 8 12 False NaN
# 5 Mick 9 8 False NaN
# 6 Tom 4 15 True 19.500000
# 7 Tom 5 24 False NaN
# 8 Tom 6 17 False NaN
# 9 Tom 7 22 False NaN
# 10 Tom 8 20 False NaN
# 11 Tom 9 18 True 18.000000
CodePudding user response:
Select only rows with True
s by swap ordering of rows by DataFrame.iloc
for forward rolling, use Series.rolling
per groups and remove first level of MultiIndex
by Series.droplevel
:
df.loc[df['test'], 'new'] = (df.iloc[::-1]
.groupby('user')['logins']
.rolling(4, min_periods=1)
.mean()
.droplevel(0))
print(df)
user Month logins test new
0 Mick 4 5 True 5.250000
1 Mick 5 4 False NaN
2 Mick 6 7 False NaN
3 Mick 7 5 True 8.333333
4 Mick 8 12 False NaN
5 Mick 9 8 False NaN
6 Tom 4 15 True 19.500000
7 Tom 5 24 False NaN
8 Tom 6 17 False NaN
9 Tom 7 22 False NaN
10 Tom 8 20 False NaN
11 Tom 9 18 True 18.000000