I have a dataframe like as shown below
sub_id,teacher,div,pid,pos_date
1,ABC,SCIENCE,A1,12/10/2021
1,ABC,SCIENCE,A1,22/06/2019
1,ABC,SCIENCE,A1,12/12/2018
1,ABC,SCIENCE,A1,27/11/2020
1,DEF,CHEMISTRY,A1,12/10/2021
1,DEF,CHEMISTRY,A2,11/11/2018
1,DEF,CHEMISTRY,A2,12/10/2021
1,ABC,SCIENCE,A2,12/10/2019
1,ABC,SCIENCE,A2,12/10/2020
1,ABC,SCIENCE,A3,12/11/2021
1,ABC,SCIENCE,A3,22/03/2022
1,ABC,SCIENCE,A4,22/10/2021
1,ABC,SCIENCE,A4,12/04/2021
df = pd.read_clipboard()
I would like to do the below
a) Group by sub_id,teacher,div and pid
b) For each group, compute the below
1) Max(pos_date)
2) Average gap between each pos_date
3) Median gap between each pos_date
4) No of records (count)
So, I tried the below
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].max()
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].average()
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].median()
df.set_index(['sub_id','teacher','div','pid']).groupby(['sub_id','teacher','div','pid'])['pos_date'].size()
but the above is neither elegant nor efficient.
Is there any better way to write the above code?
I expect my output to be like in below format
CodePudding user response:
Use groupby
and agg
:
# Ensure pos_date is DatetimeIndex
df['pos_date'] = pd.to_datetime(df['pos_date'], dayfirst=True)
out = (df.groupby(['sub_id','teacher','div','pid'])['pos_date']
.agg(['max', 'mean', 'median', 'size']))
Output:
>>> out
max mean median size
sub_id teacher div pid
1 ABC SCIENCE A1 2021-10-12 2020-04-10 18:00:00 2020-03-10 00:00:00 4
A2 2020-10-12 2020-04-12 00:00:00 2020-04-12 00:00:00 2
A3 2022-03-22 2022-01-16 00:00:00 2022-01-16 00:00:00 2
A4 2021-10-22 2021-07-17 12:00:00 2021-07-17 12:00:00 2
DEF CHEMISTRY A1 2021-10-12 2021-10-12 00:00:00 2021-10-12 00:00:00 1
A2 2021-10-12 2020-04-27 00:00:00 2020-04-27 00:00:00 2
Update
out = (df.groupby(['sub_id','teacher','div','pid'])['pos_date']
.agg(**{'Max': 'max', 'Mean': lambda x: x.diff().mean().days,
'Median': lambda x: x.diff().median(), 'Size': 'size'}))
Output:
>>> out
Max Mean Median Size
sub_id teacher div pid
1 ABC SCIENCE A1 2021-10-12 -107.0 -192 days 4
A2 2020-10-12 366.0 366 days 2
A3 2022-03-22 130.0 130 days 2
A4 2021-10-22 -193.0 -193 days 2
DEF CHEMISTRY A1 2021-10-12 NaN NaT 1
A2 2021-10-12 1066.0 1066 days 2