Home > Enterprise >  pandas groupby and do categorical ordering to drop duplicates
pandas groupby and do categorical ordering to drop duplicates

Time:01-03

I have a dataframe like as below

df = pd.DataFrame({
        "Name": ["Tim", "Tim", "Tim", "Tim", "Tim",'Jack','Jack','Jack'],
        "Status": ["A1", "E1", "B3", "D4", "C90","A1","C90","B3"]
})

The actual order of my status variable is B3 < A1 < D4 < C90 < E1.

So the last value is E1 and 1st value is B3.

I would like to do the below

a) groupby Name

a) order the values based on categorical ordering shown above

c) Keep only the last value (after dropping duplicates based on Name column)

So, I tried the below

df["Status"] = df["Status"].astype("category")
df["Status"] = df["Status"].cat.set_categories(["B3", "A1", "D4", "C90", "E90"], ordered=True)
df = df.sort_values(['Status'])
df_cleaned = df.drop_duplicates(['Status'],keep='last')

But this results in incorrect output.

I expect my output to be like as below (one row for each Name and their latest/last Status value)

Name   Status
Tim     E1
Jack    C90

CodePudding user response:

Add existing categories to list and sorting with remove duplicates by Name column:

df["Status"] = pd.Categorical(df["Status"], 
                              categories=["B3", "A1", "D4", "C90", "E90","E1"], 
                              ordered=True)

df_cleaned = (df.sort_values(['Status'])
                .drop_duplicates(['Name'],keep='last')

print (df_cleaned)
   Name Status
6  Jack    C90
1   Tim     E1

If possible some values are not in list of categories also remove missing values:

df_cleaned = (df.dropna(subset=['Status'])
                .sort_values(['Status'])
                .drop_duplicates(['Name'],keep='last')

CodePudding user response:

You can abstract the category column creation with encode_categorical from pyjanitor, and either use drop_duplicates or groupby:

# pip install pyjanitor
import pandas as pd
import janitor
(df
.encode_categorical(Status=['B3', 'A1', 'D4','C90','E1'])
.sort_values(['Name','Status'])
# you can skip the lines below with drop_duplicates
# .drop_duplicates(subset='Name', keep='last')
.groupby('Name', as_index=False)
.Status
.last()
) 
   Name Status
0  Jack    C90
1   Tim     E1
  • Related