I have a pandas dataframe ; sample shown below:
my_id Status my_date
434 A 2021-12-20
434 S1 2021-12-23
434 S1 2021-12-24
434 S1 2021-12-24
434 S1 2021-12-25
434 S2 2022-01-05
434 S2 2022-01-06
434 S2 2022-01-07
434 S2 2022-01-08
434 S2 2022-01-09
434 S2 2022-01-10
434 B 2022-01-11
434 S1 2022-10-20
434 S1 2022-10-24
434 S2 2022-10-25
980 S1 2021-12-20
341 S1 2021-08-18
387 S1 2022-03-15
625 S1 2022-02-16
625 S1 2022-02-22
625 S2 2022-02-25
905 S1 2022-03-14
905 S1 2022-03-16
905 S1 2022-03-18
I want to create three columns :
received_date : Looks at Status per my_id ; if the first status is "S1" ; then the received_date is equal to my_date; if the status change to any other status and changes back to "S1" then received_date again equals to my_date else NAN
accept_date : Looks at Status per my_id ; if the status is "S2" ; then the accept_date is equal to my_date; if the status change to any other status and changes back to "S2" then received_date again equals to my_date else NAN
end_date : If the status changes from "S1" or "S2" to any other status then end_date is equal to my_date else it equals to last my_date per my_id
Expected Output :
my_id Status my_date received_date Accept_date End_date
434 A 2021-12-20 NAN NAN NAN
434 S1 2021-12-23 2021-12-23 NAN NAN
434 S1 2021-12-24 NAN NAN NAN
434 S1 2021-12-24 NAN NAN NAN
434 S1 2021-12-25 NAN NAN NAN
434 S2 2022-01-05 NAN 2022-01-05 NAN
434 S2 2022-01-06 NAN NAN NAN
434 S2 2022-01-07 NAN NAN NAN
434 S2 2022-01-08 NAN NAN NAN
434 S2 2022-01-09 NAN NAN NAN
434 S2 2022-01-10 NAN NAN NAN
434 B 2022-01-11 NAN NAN 2022-01-11
434 S1 2022-10-20 2021-10-20 NAN NAN
434 S1 2022-10-24 NAN NAN NAN
434 S2 2022-10-25 NAN 2022-10-25 2022-10-25
980 S1 2021-12-20 2021-12-20 NAN 2021-12-20
341 S1 2021-08-18 2021-08-18 NAN 2021-08-18
387 S1 2022-03-15 2022-03-15 NAN 2022-03-15
625 S1 2022-02-16 2022-02-16 NAN NAN
625 S1 2022-02-22 NAN NAN NAN
625 S2 2022-02-25 NAN 2022-02-25 2022-02-25
905 S1 2022-03-14 2022-03-14 NAN NAN
905 S1 2022-03-16 NAN NAN NAN
905 S1 2022-03-18 NAN NAN 2022-03-18
CodePudding user response:
I don't get why B has a value in End_Date, apart from that, the rest is relatively easily done with boolean indexing:
m1 = df['Status'].str.match(r'S\d ')
m2 = df['Status'].ne(df.groupby('my_id')['Status'].shift())
m3 = (~m1)&m1.groupby(df['my_id']).shift()
out = (df
.join(df
.loc[m1&m2, ['Status', 'my_date']]
.set_index('Status', append=True)['my_date'].unstack()
.rename(columns={'S1': 'received_date', 'S2': 'Accept_date'})
)
.join(pd.concat([df.loc[m3, 'my_date'],
df.groupby('my_id')['my_date'].tail(1)]).rename('End_date')
)
)
output:
my_id Status my_date received_date Accept_date End_date
0 434 A 2021-12-20 NaN NaN NaN
1 434 S1 2021-12-23 2021-12-23 NaN NaN
2 434 S1 2021-12-24 NaN NaN NaN
3 434 S1 2021-12-24 NaN NaN NaN
4 434 S1 2021-12-25 NaN NaN NaN
5 434 S2 2022-01-05 NaN 2022-01-05 NaN
6 434 S2 2022-01-06 NaN NaN NaN
7 434 S2 2022-01-07 NaN NaN NaN
8 434 S2 2022-01-08 NaN NaN NaN
9 434 S2 2022-01-09 NaN NaN NaN
10 434 S2 2022-01-10 NaN NaN NaN
11 434 B 2022-01-11 NaN NaN 2022-01-11
12 434 S1 2022-10-20 2022-10-20 NaN NaN
13 434 S1 2022-10-24 NaN NaN NaN
14 434 S2 2022-10-25 NaN 2022-10-25 2022-10-25
15 980 S1 2021-12-20 2021-12-20 NaN 2021-12-20
16 341 S1 2021-08-18 2021-08-18 NaN 2021-08-18
17 387 S1 2022-03-15 2022-03-15 NaN 2022-03-15
18 625 S1 2022-02-16 2022-02-16 NaN NaN
19 625 S1 2022-02-22 NaN NaN NaN
20 625 S2 2022-02-25 NaN 2022-02-25 2022-02-25
21 905 S1 2022-03-14 2022-03-14 NaN NaN
22 905 S1 2022-03-16 NaN NaN NaN
23 905 S1 2022-03-18 NaN NaN 2022-03-18