My data (df) looks like this:
Date | Name | Plan |
---|---|---|
2022 | John | College |
2022 | John | Work |
2021 | Kel | College |
2022 | James | Work |
2019 | Daron | College |
2019 | JQ | NaN |
2020 | Mel | College |
2017 | Shama | Work |
2021 | John | Nan |
2020 | John | Work |
2021 | Mel | Work |
2018 | Shama | Work |
My end result needs one plan (the most recent one), per one name.
Currently I: Drop all Plan NaN values, then sort by service date, and drop all but the most recent date using this code:
df = df.dropna(subset=['Plan'])
df = df.sort_values('Date').drop_duplicates('Name', keep='last')
This mostly works, but I need 'College' to take precedence over 'Work' when the two are put together on the same date. In the data above, this row: | 2022 | John |Work | would be the one kept from dropping duplicates and not the one with 'College'.
Everything works, except this little part where the dates are duplicated AND there are two differing plans.
In a non pandas setting I would think this:
if service dates are duplicated AND one == college AND other == anything else: then keep the one with college
The end result I need:
Date | Name | Plan |
---|---|---|
2022 | John | College |
2021 | Kel | College |
2022 | James | Work |
2019 | Daron | College |
2019 | JQ | NaN |
2021 | Mel | Work |
2018 | Shama | Work |
Let me know if that makes sense, Thank you!
CodePudding user response:
You can use a custom sort for "plan", where you give priority to "College" over "Work". Here taking advantage of an ordered Categorical
, but you could also go with a mapping from a dictionary:
(df
.assign(cat=pd.Categorical(df['Plan'], categories=['Work', 'College'],
ordered=True))
.sort_values(by=['Date', 'cat'], na_position='first')
.drop(columns='cat')
.groupby('Name', as_index=False).last()
)
output:
Name Date Plan
0 Daron 2019 College
1 JQ 2019 None
2 James 2022 Work
3 John 2022 College
4 Kel 2021 College
5 Mel 2021 Work
6 Shama 2018 Work
CodePudding user response:
Let us sort the values then drop the duplicates in Name
df['Plan'] = pd.Categorical(df['Plan'], ['College', 'Work'], True)
df.sort_values(['Date', 'Plan'], ascending=[0, 1]).drop_duplicates('Name')
Date Name Plan
0 2022 John College
3 2022 James Work
2 2021 Kel College
10 2021 Mel Work
4 2019 Daron College
5 2019 JQ NaN
11 2018 Shama Work