Home > database >  Compute avg gap between dates and max date for each group using pandas
Compute avg gap between dates and max date for each group using pandas

Time:07-04

I have a dataframe like as shown below

sub_id,teacher,div,pid,pos_date
 1,ABC,SCIENCE,A1,12/10/2021
 1,ABC,SCIENCE,A1,22/06/2019
 1,ABC,SCIENCE,A1,12/12/2018
 1,ABC,SCIENCE,A1,27/11/2020
 1,DEF,CHEMISTRY,A1,12/10/2021
 1,DEF,CHEMISTRY,A2,11/11/2018
 1,DEF,CHEMISTRY,A2,12/10/2021
 1,ABC,SCIENCE,A2,12/10/2019
 1,ABC,SCIENCE,A2,12/10/2020
 1,ABC,SCIENCE,A3,12/11/2021
 1,ABC,SCIENCE,A3,22/03/2022
 1,ABC,SCIENCE,A4,22/10/2021
 1,ABC,SCIENCE,A4,12/04/2021

 df = pd.read_clipboard()

I would like to do the below

a) Group by sub_id,teacher,div and pid

b) For each group, compute the below

       1)  Max(pos_date)
       2) Average gap between each pos_date
       3) Median gap between each pos_date
       4) No of records (count)

So, I tried the below

df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].max()
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].average()
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].median()
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].size()

but the above is neither elegant nor efficient.

Is there any better way to write the above code?

I expect my output to be like in below format

enter image description here

CodePudding user response:

Use groupby and agg:

# Ensure pos_date is DatetimeIndex
df['pos_date'] = pd.to_datetime(df['pos_date'], dayfirst=True)

out = (df.groupby(['sub_id','teacher','div','pid'])['pos_date']
         .agg(['max', 'mean', 'median', 'size']))

Output:

>>> out
                                    max                mean              median  size
sub_id teacher div       pid                                                         
1      ABC     SCIENCE   A1  2021-10-12 2020-04-10 18:00:00 2020-03-10 00:00:00     4
                         A2  2020-10-12 2020-04-12 00:00:00 2020-04-12 00:00:00     2
                         A3  2022-03-22 2022-01-16 00:00:00 2022-01-16 00:00:00     2
                         A4  2021-10-22 2021-07-17 12:00:00 2021-07-17 12:00:00     2
       DEF     CHEMISTRY A1  2021-10-12 2021-10-12 00:00:00 2021-10-12 00:00:00     1
                         A2  2021-10-12 2020-04-27 00:00:00 2020-04-27 00:00:00     2

Update

out = (df.groupby(['sub_id','teacher','div','pid'])['pos_date']
         .agg(**{'Max': 'max', 'Mean': lambda x: x.diff().mean().days,
                 'Median': lambda x: x.diff().median(), 'Size': 'size'}))

Output:

>>> out
                                    Max    Mean    Median  Size
sub_id teacher div       pid                                   
1      ABC     SCIENCE   A1  2021-10-12  -107.0 -192 days     4
                         A2  2020-10-12   366.0  366 days     2
                         A3  2022-03-22   130.0  130 days     2
                         A4  2021-10-22  -193.0 -193 days     2
       DEF     CHEMISTRY A1  2021-10-12     NaN       NaT     1
                         A2  2021-10-12  1066.0 1066 days     2
  • Related