dedup records(window function pandas)-CodePudding

Hi I am looking to dedup my records ordered by cancel date so I will only be interested in the most recent record.

sample data

id	cancel_date	type_of_fruit
1	2021-03-02	apple
1	2021-01-01	apple
2	2021-02-01	orange

expected output

id	cancel_date	type_of_fruit
1	2021-03-02	apple
2	2021-02-01	orange

I wrote the SQL way but I have to implement this logic in pandas, please help

SELECT 
    * 
FROM 
    (SELECT *, 
            rank() over(partition by id order by cancel_date desc) as rank 
     FROM df 
     ORDER BY id, cancel_date DESC) a 
where rank = 1

CodePudding user response：

Here is how you can achieve this.

Below code will convert cancel_date column into datetime object, because you want to order it using cancel_date:

#--if cancel_date is a string, then this code will convert to datetime--

import pandas as pd
df['cancel_date']= pd.to_datetime(df['cancel_date'])

Next grouping the table on id (this is similar to partition in SQL), then using cancel_date column to be sorted in descending order. Below code will achieve the same:

df["Rank"] = df.groupby("id")["cancel_date"].rank(method="first", ascending= False)

Finally, filtering the data with rank as 1:

filtered_df = df[df["Rank"] == 1]
filtered_df.head()