i'm scratching my head on this problem for a while now. I have a log with a case id, an activity id and the date in which the activity occurred. i need to add a column which will represent the end timestamp (DATEEND), but i don't know how to make the last date in the column DATEEND of each case id as the last date in the column DATESTART.
i was thinking about using groupby() and shift(-1), but then i don't know how to fix the last date. Any idea?
Input:
CASEID ACTIVITYID DATESTART
725/2022 _0 2022-03-09
725/2022 0F 2022-03-23
725/2022 YB 2022-06-21
725/2022 AA 2022-06-28
725/2022 1F 2022-06-28
725/2022 2B 2022-06-29
444/2022 AA 2022-07-26
444/2022 1F 2022-07-28
444/2022 2B 2022-07-28
Output:
CASEID ACTIVITYID DATESTART DATEEND
725/2022 _0 2022-03-09 2022-03-23
725/2022 0F 2022-03-23 2022-06-21
725/2022 YB 2022-06-21 2022-06-28
725/2022 AA 2022-06-28 2022-06-28
725/2022 1F 2022-06-28 2022-06-29
725/2022 2B 2022-06-29 2022-06-29
444/2022 AA 2022-07-26 2022-07-28
444/2022 1F 2022-07-28 2022-07-28
444/2022 2B 2022-07-28 2022-07-28
Thanks!
CodePudding user response:
Use Series.fillna
by column DATESTART
:
df['DATEEND'] = df.groupby('CASEID')['DATESTART'].shift(-1).fillna(df['DATESTART'])
print (df)
CASEID ACTIVITYID DATESTART DATEEND
0 725/2022 _0 2022-03-09 2022-03-23
1 725/2022 0F 2022-03-23 2022-06-21
2 725/2022 YB 2022-06-21 2022-06-28
3 725/2022 AA 2022-06-28 2022-06-28
4 725/2022 1F 2022-06-28 2022-06-29
5 725/2022 2B 2022-06-29 2022-06-29
6 444/2022 AA 2022-07-26 2022-07-28
7 444/2022 1F 2022-07-28 2022-07-28
8 444/2022 2B 2022-07-28 2022-07-28