I have a data frame similar to the below. I would like to calculate the daily average time taken to go from step 1 to step 2 for a particular user. If a user restarts the process (logged as step 1), then the duration would be from calculated as the difference between this new step 1 and the subsequent step 2. Durations are calculated only using those records with both a step 1 and step 2 for the same user.
user action time
0 3249 step 1 2021-10-25 19:45:43
1 2160 step 1 2021-10-25 19:48:46
2 2160 step 2 2021-10-25 19:50:21
3 3249 step 2 2021-10-25 19:57:34
4 5120 step 1 2021-10-25 20:30:56
5 3900 step 1 2021-10-25 20:35:40
6 3900 step 1 2021-10-25 20:50:59
7 3900 step 2 2021-10-25 21:15:08
CodePudding user response:
Use pivot_table
to transform your dataframe:
# Use a clean dataframe
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values('time')
out = df.pivot_table(index=['user', df['time'].dt.date], columns='action',
values='time', aggfunc='last').reset_index()
out['daily_avg'] = out.groupby(['user', 'time'], as_index=False) \
.apply(lambda x: x['step 2'] - x['step 1']).values
out = out.groupby('user')['daily_avg'].mean().reset_index()
Output result:
>>> out
user daily_avg
0 2160 0 days 00:01:35
1 3249 0 days 00:11:51
2 3900 0 days 00:25:03
3 5120 NaT
My setup (I slightly modified your dataframe):
data = {'user': [3249, 2160, 2160, 3249, 5120, 3900, 3900, 3900, 3900, 3900], 'action': ['step 1', 'step 1', 'step 2', 'step 2', 'step 1', 'step 1', 'step 1', 'step 2', 'step 1', 'step 2'], 'time': ['2021-10-25 19:45:43', '2021-10-25 19:48:46', '2021-10-25 19:50:21', '2021-10-25 19:57:34', '2021-10-25 20:30:56', '2021-10-25 20:35:40', '2021-10-25 20:50:59', '2021-10-25 21:15:08', '2021-10-26 18:23:41', '2021-10-26 18:49:38']}
df = pd.DataFrame(data)
print(df)
# Output
user action time
0 3249 step 1 2021-10-25 19:45:43
1 2160 step 1 2021-10-25 19:48:46
2 2160 step 2 2021-10-25 19:50:21
3 3249 step 2 2021-10-25 19:57:34
4 5120 step 1 2021-10-25 20:30:56
5 3900 step 1 2021-10-25 20:35:40
6 3900 step 1 2021-10-25 20:50:59
7 3900 step 2 2021-10-25 21:15:08
8 3900 step 1 2021-10-26 18:23:41 # added
9 3900 step 2 2021-10-26 18:49:38 # added
CodePudding user response:
Ok, let me try:
(df
.sort_values(by='time') # just to make sure
.assign(day = lambda x: x['time'].dt.day)
.groupby(['user', 'day']) # seems like a good place to start, also groupby preserves order.
.apply(lambda x: (x
.loc[lambda x: x != x.shift(1)] # Keeps only the last of step 1 (but allows for more than one process per day)
.assign(diff = lambda x: x['time'] - x['time'].shift(1))
['diff'].mean()
)
)
)
I've not checked this, but give it a go.
If I got it wrong it's probably that it should be .shift(-1)