Calculate average daily duration for a complete two step process-CodePudding

I have a data frame similar to the below. I would like to calculate the daily average time taken to go from step 1 to step 2 for a particular user. If a user restarts the process (logged as step 1), then the duration would be from calculated as the difference between this new step 1 and the subsequent step 2. Durations are calculated only using those records with both a step 1 and step 2 for the same user.

    user    action  time
0   3249    step 1  2021-10-25 19:45:43
1   2160    step 1  2021-10-25 19:48:46
2   2160    step 2  2021-10-25 19:50:21
3   3249    step 2  2021-10-25 19:57:34
4   5120    step 1  2021-10-25 20:30:56
5   3900    step 1  2021-10-25 20:35:40
6   3900    step 1  2021-10-25 20:50:59
7   3900    step 2  2021-10-25 21:15:08

CodePudding user response：

Use pivot_table to transform your dataframe:

# Use a clean dataframe
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values('time')

out = df.pivot_table(index=['user', df['time'].dt.date], columns='action',
                     values='time', aggfunc='last').reset_index()

out['daily_avg'] = out.groupby(['user', 'time'], as_index=False) \
                      .apply(lambda x: x['step 2'] - x['step 1']).values

out = out.groupby('user')['daily_avg'].mean().reset_index()

Output result:

>>> out
   user       daily_avg
0  2160 0 days 00:01:35
1  3249 0 days 00:11:51
2  3900 0 days 00:25:03
3  5120             NaT

My setup (I slightly modified your dataframe):

data = {'user': [3249, 2160, 2160, 3249, 5120, 3900, 3900, 3900, 3900, 3900], 'action': ['step 1', 'step 1', 'step 2', 'step 2', 'step 1', 'step 1', 'step 1', 'step 2', 'step 1', 'step 2'], 'time': ['2021-10-25 19:45:43', '2021-10-25 19:48:46', '2021-10-25 19:50:21', '2021-10-25 19:57:34', '2021-10-25 20:30:56', '2021-10-25 20:35:40', '2021-10-25 20:50:59', '2021-10-25 21:15:08', '2021-10-26 18:23:41', '2021-10-26 18:49:38']}

df = pd.DataFrame(data)
print(df)

# Output
   user  action                 time
0  3249  step 1  2021-10-25 19:45:43
1  2160  step 1  2021-10-25 19:48:46
2  2160  step 2  2021-10-25 19:50:21
3  3249  step 2  2021-10-25 19:57:34
4  5120  step 1  2021-10-25 20:30:56
5  3900  step 1  2021-10-25 20:35:40
6  3900  step 1  2021-10-25 20:50:59
7  3900  step 2  2021-10-25 21:15:08
8  3900  step 1  2021-10-26 18:23:41  # added
9  3900  step 2  2021-10-26 18:49:38  # added

CodePudding user response：

Ok, let me try:

(df
  .sort_values(by='time') # just to make sure
  .assign(day = lambda x: x['time'].dt.day)
  .groupby(['user', 'day']) # seems like a good place to start, also groupby preserves order.
  .apply(lambda x: (x
            .loc[lambda x: x != x.shift(1)] # Keeps only the last of step 1 (but allows for more than one process per day)
            .assign(diff = lambda x: x['time'] - x['time'].shift(1))
            ['diff'].mean()
        )
  )
)

I've not checked this, but give it a go. If I got it wrong it's probably that it should be .shift(-1)