I have a data frame of users activity for each week and date within that week. I'm trying to calculate the average difference of activity time for each user. This is an example of the data:
ID | week | date |
---|---|---|
1 | 1 | 20/07/22 07:01:00 |
1 | 1 | 20/07/22 07:02:00 |
1 | 1 | 20/07/22 09:02:00 |
1 | 1 | 20/07/22 09:03:00 |
1 | 1 | 22/07/22 10:00:00 |
1 | 1 | 22/07/22 10:02:00 |
1 | 2 | 28/07/22 11:03:00 |
1 | 2 | 28/07/22 11:04:00 |
1 | 2 | 28/07/22 11:05:00 |
So far this is what I've managed to do:
ID | week | date | avg_difference |
---|---|---|---|
1 | 1 | 20/07/22 | 20.666 |
1 | 1 | 22/07/22 | 2 |
1 | 2 | 28/07/22 | 1 |
My issue is that the average for 20/07/22 is biased, because users sometimes visit the site multiple times a day at different hours.
I would like to calculate the average of the dates where the minutes are close together, in the range of the same hour, so the output will look something like this:
ID | week | date | avg_difference |
---|---|---|---|
1 | 1 | 20/07/22 | 1 |
1 | 1 | 20/07/22 | 1 |
1 | 1 | 22/07/22 | 2 |
1 | 2 | 28/07/22 | 1 |
The code I used:
df1=pd.read_csv(r'C:\Users\Documents\Python\Data03.csv')
df1["date"] = pd.to_datetime(df1["date"])
day = d['date'].dt.normalize()
xx=d.sort_values('date').groupby(['ID', 'week',day])['date'].agg(lambda x: x.diff().mean()).dt.total_seconds()/60
xx_df=xx.to_frame('avg_difference').reset_index().fillna(0)
Would appreciate your help and insights!
Thanks!!
CodePudding user response:
For separate groups if get difference above threshold, e.g. 3600
seconds create separate groups use for count final mean of differencies:
df1["date"] = pd.to_datetime(df1["date"], dayfirst=True)
df1 = df1.sort_values('date')
threshold = 3600
groups = (df1.groupby(['ID', 'week', pd.Grouper(freq='d', key='date')])['date']
.transform(lambda x: x.diff().dt.total_seconds().gt(threshold).cumsum()))
out = (df1.groupby(['ID', 'week',pd.Grouper(freq='d', key='date'), groups])['date']
.agg(lambda x: x.diff().mean())
.dt.total_seconds()
.div(60)
.droplevel(-1)
.reset_index(name='avg_difference'))
print (out)
ID week date avg_difference
0 1 1 2022-07-20 1.0
1 1 1 2022-07-20 1.0
2 1 1 2022-07-22 2.0
3 1 2 2022-07-28 1.0
CodePudding user response:
With a slight modification of your code, you can get rid of the delta above a given threshold, and still keep a single aggregation per day:
thresh = '1h'
day = d['date'].dt.normalize()
xx=d.sort_values('date').groupby(['ID', 'week',day])['date'].agg(lambda x: x.diff().loc[lambda x: x.lt(thresh)].mean()).dt.total_seconds()/60
xx_df=xx.to_frame('avg_difference').reset_index().fillna(0)
Output:
ID week date avg_difference
0 1 1 2022-07-20 1.0
1 1 1 2022-07-22 2.0
2 1 2 2022-07-28 1.0
If you want to separate the visits within a day:
thresh = '1h'
d = d.sort_values('date')
diff = d.groupby(['ID', 'week', day])['date'].diff()
m = diff.gt(thresh)
(d.assign(diff=diff)[~m].groupby(['ID', 'week', day, m.cumsum().rename('n')])
['diff'].mean().droplevel('n').reset_index()
)
Output:
ID week date diff
0 1 1 2022-07-20 0 days 00:01:00
1 1 1 2022-07-20 0 days 00:01:00
2 1 1 2022-07-22 0 days 00:02:00
3 1 2 2022-07-28 0 days 00:01:00