PANDAS calculate the difference average of dates by hours-CodePudding

I have a data frame of users activity for each week and date within that week. I'm trying to calculate the average difference of activity time for each user. This is an example of the data:

ID	week	date
1	1	20/07/22 07:01:00
1	1	20/07/22 07:02:00
1	1	20/07/22 09:02:00
1	1	20/07/22 09:03:00
1	1	22/07/22 10:00:00
1	1	22/07/22 10:02:00
1	2	28/07/22 11:03:00
1	2	28/07/22 11:04:00
1	2	28/07/22 11:05:00

So far this is what I've managed to do:

ID	week	date	avg_difference
1	1	20/07/22	20.666
1	1	22/07/22	2
1	2	28/07/22	1

My issue is that the average for 20/07/22 is biased, because users sometimes visit the site multiple times a day at different hours.

I would like to calculate the average of the dates where the minutes are close together, in the range of the same hour, so the output will look something like this:

ID	week	date	avg_difference
1	1	20/07/22	1
1	1	20/07/22	1
1	1	22/07/22	2
1	2	28/07/22	1

The code I used:

df1=pd.read_csv(r'C:\Users\Documents\Python\Data03.csv')

df1["date"] = pd.to_datetime(df1["date"])
day = d['date'].dt.normalize()
xx=d.sort_values('date').groupby(['ID', 'week',day])['date'].agg(lambda x: x.diff().mean()).dt.total_seconds()/60
xx_df=xx.to_frame('avg_difference').reset_index().fillna(0)

Would appreciate your help and insights!

Thanks!!

CodePudding user response：

For separate groups if get difference above threshold, e.g. 3600 seconds create separate groups use for count final mean of differencies:

df1["date"] = pd.to_datetime(df1["date"], dayfirst=True)
df1 = df1.sort_values('date')

threshold = 3600

groups = (df1.groupby(['ID', 'week', pd.Grouper(freq='d', key='date')])['date']
            .transform(lambda x: x.diff().dt.total_seconds().gt(threshold).cumsum()))


out = (df1.groupby(['ID', 'week',pd.Grouper(freq='d', key='date'), groups])['date']
          .agg(lambda x: x.diff().mean())
          .dt.total_seconds()
          .div(60)
          .droplevel(-1)
          .reset_index(name='avg_difference'))
print (out)
   ID  week       date  avg_difference
0   1     1 2022-07-20             1.0
1   1     1 2022-07-20             1.0
2   1     1 2022-07-22             2.0
3   1     2 2022-07-28             1.0

CodePudding user response：

With a slight modification of your code, you can get rid of the delta above a given threshold, and still keep a single aggregation per day:

thresh = '1h'

day = d['date'].dt.normalize()
xx=d.sort_values('date').groupby(['ID', 'week',day])['date'].agg(lambda x: x.diff().loc[lambda x: x.lt(thresh)].mean()).dt.total_seconds()/60
xx_df=xx.to_frame('avg_difference').reset_index().fillna(0)

Output:

   ID  week       date  avg_difference
0   1     1 2022-07-20             1.0
1   1     1 2022-07-22             2.0
2   1     2 2022-07-28             1.0

If you want to separate the visits within a day:

thresh = '1h'

d = d.sort_values('date')

diff = d.groupby(['ID', 'week', day])['date'].diff()
m = diff.gt(thresh)

(d.assign(diff=diff)[~m].groupby(['ID', 'week', day, m.cumsum().rename('n')])
 ['diff'].mean().droplevel('n').reset_index()
)

Output:

   ID  week       date            diff
0   1     1 2022-07-20 0 days 00:01:00
1   1     1 2022-07-20 0 days 00:01:00
2   1     1 2022-07-22 0 days 00:02:00
3   1     2 2022-07-28 0 days 00:01:00