Selecting the desired rows group by in pandas dataframe based on column value change-CodePudding

I have source dataframe, df like below.

order_id	prod_id	date	authorized
111	P	2022-01-01 00:00:00	N
111	P	2022-01-02 00:00:00	N
111	P	2022-01-03 00:00:00	Y
111	Q	2022-01-02 00:00:00	Y
112	P	2022-01-01 00:00:00	N
112	P	2022-01-02 00:00:00	Y
112	Q	2022-01-01 00:00:00	N
112	Q	2022-01-02 00:00:00	N
112	Q	2022-01-03 00:00:00	N
112	Q	2022-01-04 00:00:00	N
113	P	2022-01-01 00:00:00	N
113	P	2022-01-02 00:00:00	Y
113	P	2022-01-03 00:00:00	Y
113	P	2022-01-04 00:00:00	Y
113	P	2022-01-05 00:00:00	Y
114	Z	2022-01-05 00:00:00	Y

my goal is to create a target dataframe like below

Here is the logic to be implemented for target dataframe creation:

a) for an order_id and prod_id combination take the first row(sorted by date ascending) where authorized=Y

b) for an order id and prod id combination, if none of the records have authorized = 'Y', take the first row,sorted by date ascending.

Is there any way to implement this efficiently, in Pandas?. I did some googling, but couldn't find a solution. Thanks in advance.

CodePudding user response：

Sort your dataframe by authorized and date columns then group by order_id and product_id columns. Finally, get the first row of each group.

# Convert date as datetime64
df['date'] = pd.to_datetime(df['date'], dayfirst=False)

out = df.sort_values(['authorized', 'date'], ascending=[False, True]) \
        .groupby(['order_id', 'prod_id']).first().reset_index()
print(out)

# Output
   order_id prod_id       date authorized
0       111       P 2022-01-03          Y
1       111       Q 2022-01-02          Y
2       112       P 2022-01-02          Y
3       112       Q 2022-01-01          N
4       113       P 2022-01-02          Y
5       114       Z 2022-01-05          Y