I have a dataset like this.
- I would like to calculate the mean for each user based on previous events for the user.
- Calculate the cumulative number of events for each user
- Calculate the number of days since the last event for a specific user.
from datetime import date
import pandas as pd
Sales = [1000, 1450, 1390, 1778, 1966, 1100, 700, 900]
User_ID = [50, 52, 50, 53,53, 52, 52, 52]
begin_date = '2022-10-16 10:12:00'
# initialize data of lists.
data = {'date':pd.date_range(begin_date, periods=len(Sales)),
'User_ID': User_ID,
'Sales' : Sales}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
Input
Output
CodePudding user response:
you can get cumulative average by expanding
g = df.groupby('User_ID')
g['Sales'].transform(lambda x: x.expanding().mean())
you can get time gap following code:
s = g['date'].transform(lambda x: x.sub(x.shift(1)))
s
0 NaT
1 NaT
2 2 days
3 NaT
4 1 days
5 4 days
6 1 days
7 1 days
Name: date, dtype: timedelta64[ns]
you can get cumcount
g.cumcount() 1
make desired output
df.assign(mean= g['Sales'].transform(lambda x: x.expanding().mean()),
Time_Elapsed=s.div(pd.Timedelta('1day')).fillna(0).astype('int'),
No_of_event=g.cumcount() 1)
result:
date User_ID Sales mean Time_Elapsed No_of_event
0 2022-10-16 10:12:00 50 1000 1000.0 0 1
1 2022-10-17 10:12:00 52 1450 1450.0 0 1
2 2022-10-18 10:12:00 50 1390 1195.0 2 2
3 2022-10-19 10:12:00 53 1778 1778.0 0 1
4 2022-10-20 10:12:00 53 1966 1872.0 1 2
5 2022-10-21 10:12:00 52 1100 1275.0 4 2
6 2022-10-22 10:12:00 52 700 1083.3 1 3
7 2022-10-23 10:12:00 52 900 1037.5 1 4