`
data = {'id': ['1','1','1','1','2','2','2','2','2','3','3','3','3','3','3','3'],
'status': ['Active','Active','Active','Pending Action','Pending Action','Pending Action','Active','Pending Action','Active','Draft','Active','Draft','Draft','Draft','Active','Draft'],
'calc_date_id':['05/07/2022','07/06/2022','31/08/2021','01/07/2021','20/11/2022','25/10/2022','02/04/2022','28/02/2022','01/07/2021','23/06/2022','15/06/2022','07/04/2022','09/11/2022','18/08/2020','19/03/2020','17/01/202']
}
df = pd.DataFrame(data)
#to datetime
df['calc_date_id'] = pd.to_datetime(df['calc_date_id'])
`
How do I get the first date in the last time the status change by id?
I tried sorting by date and groupby with id and status and keep="first" but I got:
Also tried
df_mt_date.loc[df_mt_date.groupby(['id',' status'])['calc_date_id'].idxmin()]
Instead of that I'd like to preserve the order by date obtaining only the first time where the id has changed status for the last time (not all of the history).
I'm running out of ideas, I'll appreciate any suggestion
Thank you
CodePudding user response:
Try:
df["desired_output"] = df.groupby("id")["status"].transform(
lambda x: df.loc[x.index, "calc_date_id"][(x != x.shift(-1)).idxmax()]
)
print(df)
Prints:
id status calc_date_id desired_output
0 1 Active 2022-07-05 2021-08-31
1 1 Active 2022-06-07 2021-08-31
2 1 Active 2021-08-31 2021-08-31
3 1 Pending Action 2021-07-01 2021-08-31
4 2 Pending Action 2022-11-20 2022-10-25
5 2 Pending Action 2022-10-25 2022-10-25
6 2 Active 2022-04-02 2022-10-25
7 2 Pending Action 2022-02-28 2022-10-25
8 2 Active 2021-07-01 2022-10-25
9 3 Draft 2022-06-23 2022-06-23
10 3 Active 2022-06-15 2022-06-23
11 3 Draft 2022-04-07 2022-06-23
12 3 Draft 2022-11-09 2022-06-23
13 3 Draft 2020-08-18 2022-06-23
14 3 Active 2020-03-19 2022-06-23
15 3 Draft 2020-01-17 2022-06-23
CodePudding user response:
From your desired output I see, that the group "boundaries" are points where particular value of status column occurs for the first time, regardless of id column.
To indicate first occurrences of values in status column, run:
wrk = df.groupby('status', group_keys=False).apply(
lambda grp: grp.assign(isFirst=grp.index[0] == grp.index))
wrk.isFirst = wrk.isFirst.cumsum()
To see the result, print wrk and look at isFirst column.
Then, to generate the result, run:
result = wrk.groupby('isFirst', group_keys=False).apply(
lambda grp: grp.assign(desired_output=grp.calc_date_id.min()))\
.drop(columns='isFirst')
Note the terminating drop to drop now unnecessary isFirst column.
The result, for your data sample, is:
id status calc_date_id desired_output
0 1 Active 2022-07-05 2021-08-31
1 1 Active 2022-06-07 2021-08-31
2 1 Active 2021-08-31 2021-08-31
3 1 Pending Action 2021-07-01 2021-07-01
4 2 Pending Action 2022-11-20 2021-07-01
5 2 Pending Action 2022-10-25 2021-07-01
6 2 Active 2022-04-02 2021-07-01
7 2 Pending Action 2022-02-28 2021-07-01
8 2 Active 2021-07-01 2021-07-01
9 3 Draft 2022-06-23 2020-03-19
10 3 Active 2022-06-15 2020-03-19
11 3 Draft 2022-04-07 2020-03-19
12 3 Draft 2022-11-09 2020-03-19
13 3 Draft 2020-08-18 2020-03-19
14 3 Active 2020-03-19 2020-03-19
15 3 Draft 2022-01-17 2020-03-19