How to groupby and make calculations on consecutive rows of the group?-CodePudding

For example, let's consider the following dataframe:

   Restaurant_ID Floor Cust_Arrival_Datetime
0       100        1    2021-11-17 17:20:00
1       100        1    2021-11-17 17:22:00
2       100        1    2021-11-17 17:25:00
3       100        1    2021-11-17 17:30:00
4       100        1    2021-11-17 17:50:00
5       100        1    2021-11-17 17:51:00
6       100        2    2021-11-17 17:25:00
7       100        2    2021-11-17 18:00:00
8       100        2    2021-11-17 18:50:00
9       100        2    2021-11-17 18:56:00

For the above toy example we can consider that the Cust_Arrival_Datetime is sorted as well as grouped by store and floor (as seen above). How could we, now, calculate things such as the median time interval that passes for a customer arrival for each unique store and floor group?

The desired output would be:

   Restaurant_ID Floor  Median Arrival Interval(in minutes)
0       100        1    3
1       100        2    35

The Median Arrival Interval is calculated as follows: for the first floor of the store we can see that by the time the second customer arrives 2 minutes have already passed since the first one arrived. Similarly, 3 minutes have elapsed between the 2nd and the 3rd customer and 5 minutes for the 3rd and 4th customer etc. The median for floor 1 and restaurant 100 would be 3.

I have tried something like this:

df.groupby(['Restaurant_ID', 'Floor'].apply(lambda row: row['Customer_Arrival_Datetime'].shift() - row['Customer_Arrival_Datetime']).apply(np.median)

but this does not work!

Any help is welcome!

CodePudding user response：

IIUC, you can do

(df.groupby(['Restaurant_ID', 'Floor'])['Cust_Arrival_Datetime']
   .agg(lambda x: x.diff().dt.total_seconds().median()/60))

and you get

Restaurant_ID  Floor
100            1         3.0
               2        35.0
Name: Cust_Arrival_Datetime, dtype: float64

you can chain with reset_index if needed

CodePudding user response：

Consider the following data frame:

df = pd.DataFrame({
  'group': [1,1,1,2,2,2],
  'time': pd.to_datetime(
    ['14:14', '14:17', '14:25', '17:29', '17:40','17:43']
  )
})

Suppose, you'd like to apply a range of transformations:

def stats(group):
  diffs = group.diff().dt.total_seconds()/60
  return {
      'min': diffs.min(),
      'mean': diffs.mean(),
      'median': diffs.median(),
      'max': diffs.max()
  }

Then you simply have to apply these:

>>> df.groupby('group')['time'].agg(stats).apply(pd.Series)

       min  mean  median   max
group                         
1      3.0   5.5     5.5   8.0
2      3.0   7.0     7.0  11.0