Pandas groupby two columns get earliest date-CodePudding

This is the dataset:

data = {'id': ['1','1','1','1','2','2','2','2','2','3','3','3','3','3','3','3'],
                      'status': ['Active','Active','Active','Pending Action','Pending Action','Pending Action','Active','Pending Action','Active','Draft','Active','Draft','Draft','Draft','Active','Draft'],
                      'calc_date_id':['05/07/2022','07/06/2022','31/08/2021','01/07/2021','20/11/2022','25/10/2022','02/04/2022','28/02/2022','01/07/2021','23/06/2022','15/06/2022','07/04/2022','09/11/2022','18/08/2020','19/03/2020','17/01/202']
        }                

df = pd.DataFrame(data)
#to datetime
df['calc_date_id'] = pd.to_datetime(df['calc_date_id'])

How do I get the first date in the last time the status change by id?

I tried sorting by date and groupby with id and status and keep="first" but I got:

Groupbing by status

Also tried

df_mt_date.loc[df_mt_date.groupby(['id',' status'])['calc_date_id'].idxmin()]

Instead of that I'd like to preserve the order by date obtaining only the first time where the id has changed status for the last time (not all of the history).

This is the desired output

I'm running out of ideas, I'll appreciate any suggestion

Thank you

CodePudding user response：

Try:

df["desired_output"] = df.groupby("id")["status"].transform(
    lambda x: df.loc[x.index, "calc_date_id"][(x != x.shift(-1)).idxmax()]
)
print(df)

Prints:

   id          status calc_date_id desired_output
0   1          Active   2022-07-05     2021-08-31
1   1          Active   2022-06-07     2021-08-31
2   1          Active   2021-08-31     2021-08-31
3   1  Pending Action   2021-07-01     2021-08-31
4   2  Pending Action   2022-11-20     2022-10-25
5   2  Pending Action   2022-10-25     2022-10-25
6   2          Active   2022-04-02     2022-10-25
7   2  Pending Action   2022-02-28     2022-10-25
8   2          Active   2021-07-01     2022-10-25
9   3           Draft   2022-06-23     2022-06-23
10  3          Active   2022-06-15     2022-06-23
11  3           Draft   2022-04-07     2022-06-23
12  3           Draft   2022-11-09     2022-06-23
13  3           Draft   2020-08-18     2022-06-23
14  3          Active   2020-03-19     2022-06-23
15  3           Draft   2020-01-17     2022-06-23

CodePudding user response：

From your desired output I see, that the group "boundaries" are points where particular value of status column occurs for the first time, regardless of id column.

To indicate first occurrences of values in status column, run:

wrk = df.groupby('status', group_keys=False).apply(
    lambda grp: grp.assign(isFirst=grp.index[0] == grp.index))
wrk.isFirst = wrk.isFirst.cumsum()

To see the result, print wrk and look at isFirst column.

Then, to generate the result, run:

result = wrk.groupby('isFirst', group_keys=False).apply(
    lambda grp: grp.assign(desired_output=grp.calc_date_id.min()))\
    .drop(columns='isFirst')

Note the terminating drop to drop now unnecessary isFirst column.

The result, for your data sample, is:

   id          status calc_date_id desired_output
0   1          Active   2022-07-05     2021-08-31
1   1          Active   2022-06-07     2021-08-31
2   1          Active   2021-08-31     2021-08-31
3   1  Pending Action   2021-07-01     2021-07-01
4   2  Pending Action   2022-11-20     2021-07-01
5   2  Pending Action   2022-10-25     2021-07-01
6   2          Active   2022-04-02     2021-07-01
7   2  Pending Action   2022-02-28     2021-07-01
8   2          Active   2021-07-01     2021-07-01
9   3           Draft   2022-06-23     2020-03-19
10  3          Active   2022-06-15     2020-03-19
11  3           Draft   2022-04-07     2020-03-19
12  3           Draft   2022-11-09     2020-03-19
13  3           Draft   2020-08-18     2020-03-19
14  3          Active   2020-03-19     2020-03-19
15  3           Draft   2022-01-17     2020-03-19