Home > front end >  Selecting the desired rows group by in pandas dataframe based on column value change
Selecting the desired rows group by in pandas dataframe based on column value change

Time:01-05

I have source dataframe, df like below.

order_id prod_id date authorized
111 P 2022-01-01 00:00:00 N
111 P 2022-01-02 00:00:00 N
111 P 2022-01-03 00:00:00 Y
111 Q 2022-01-02 00:00:00 Y
112 P 2022-01-01 00:00:00 N
112 P 2022-01-02 00:00:00 Y
112 Q 2022-01-01 00:00:00 N
112 Q 2022-01-02 00:00:00 N
112 Q 2022-01-03 00:00:00 N
112 Q 2022-01-04 00:00:00 N
113 P 2022-01-01 00:00:00 N
113 P 2022-01-02 00:00:00 Y
113 P 2022-01-03 00:00:00 Y
113 P 2022-01-04 00:00:00 Y
113 P 2022-01-05 00:00:00 Y
114 Z 2022-01-05 00:00:00 Y

my goal is to create a target dataframe like below

enter image description here

Here is the logic to be implemented for target dataframe creation:

a) for an order_id and prod_id combination take the first row(sorted by date ascending) where authorized=Y

b) for an order id and prod id combination, if none of the records have authorized = 'Y', take the first row,sorted by date ascending.

Is there any way to implement this efficiently, in Pandas?. I did some googling, but couldn't find a solution. Thanks in advance.

CodePudding user response:

Sort your dataframe by authorized and date columns then group by order_id and product_id columns. Finally, get the first row of each group.

# Convert date as datetime64
df['date'] = pd.to_datetime(df['date'], dayfirst=False)

out = df.sort_values(['authorized', 'date'], ascending=[False, True]) \
        .groupby(['order_id', 'prod_id']).first().reset_index()
print(out)

# Output
   order_id prod_id       date authorized
0       111       P 2022-01-03          Y
1       111       Q 2022-01-02          Y
2       112       P 2022-01-02          Y
3       112       Q 2022-01-01          N
4       113       P 2022-01-02          Y
5       114       Z 2022-01-05          Y
  •  Tags:  
  • Related