I have a dataframe like as below
df = pd.DataFrame({
"Name": ["Tim", "Tim", "Tim", "Tim", "Tim",'Jack','Jack','Jack'],
"Status": ["A1", "E1", "B3", "D4", "C90","A1","C90","B3"]
})
The actual order of my status variable is B3 < A1 < D4 < C90 < E1.
So the last value is E1 and 1st value is B3.
I would like to do the below
a) groupby Name
a) order the values based on categorical ordering shown above
c) Keep only the last value (after dropping duplicates based on Name
column)
So, I tried the below
df["Status"] = df["Status"].astype("category")
df["Status"] = df["Status"].cat.set_categories(["B3", "A1", "D4", "C90", "E90"], ordered=True)
df = df.sort_values(['Status'])
df_cleaned = df.drop_duplicates(['Status'],keep='last')
But this results in incorrect output.
I expect my output to be like as below (one row for each Name
and their latest/last Status
value)
Name Status
Tim E1
Jack C90
CodePudding user response:
Add existing categories to list and sorting with remove duplicates by Name
column:
df["Status"] = pd.Categorical(df["Status"],
categories=["B3", "A1", "D4", "C90", "E90","E1"],
ordered=True)
df_cleaned = (df.sort_values(['Status'])
.drop_duplicates(['Name'],keep='last')
print (df_cleaned)
Name Status
6 Jack C90
1 Tim E1
If possible some values are not in list of categories also remove missing values:
df_cleaned = (df.dropna(subset=['Status'])
.sort_values(['Status'])
.drop_duplicates(['Name'],keep='last')
CodePudding user response:
You can abstract the category column creation with encode_categorical from pyjanitor, and either use drop_duplicates
or groupby
:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.encode_categorical(Status=['B3', 'A1', 'D4','C90','E1'])
.sort_values(['Name','Status'])
# you can skip the lines below with drop_duplicates
# .drop_duplicates(subset='Name', keep='last')
.groupby('Name', as_index=False)
.Status
.last()
)
Name Status
0 Jack C90
1 Tim E1