Python Pandas: How to choose a certain option within duplicates-CodePudding

My data (df) looks like this:

Date	Name	Plan
2022	John	College
2022	John	Work
2021	Kel	College
2022	James	Work
2019	Daron	College
2019	JQ	NaN
2020	Mel	College
2017	Shama	Work
2021	John	Nan
2020	John	Work
2021	Mel	Work
2018	Shama	Work

My end result needs one plan (the most recent one), per one name.

Currently I: Drop all Plan NaN values, then sort by service date, and drop all but the most recent date using this code:

df = df.dropna(subset=['Plan'])
df = df.sort_values('Date').drop_duplicates('Name', keep='last')

This mostly works, but I need 'College' to take precedence over 'Work' when the two are put together on the same date. In the data above, this row: | 2022 | John |Work | would be the one kept from dropping duplicates and not the one with 'College'.

Everything works, except this little part where the dates are duplicated AND there are two differing plans.

In a non pandas setting I would think this:

if service dates are duplicated AND one == college AND other == anything else: then keep the one with college

The end result I need:

Date	Name	Plan
2022	John	College
2021	Kel	College
2022	James	Work
2019	Daron	College
2019	JQ	NaN
2021	Mel	Work
2018	Shama	Work

Let me know if that makes sense, Thank you!

CodePudding user response：

You can use a custom sort for "plan", where you give priority to "College" over "Work". Here taking advantage of an ordered Categorical, but you could also go with a mapping from a dictionary:

(df
 .assign(cat=pd.Categorical(df['Plan'], categories=['Work', 'College'],
                            ordered=True))
 .sort_values(by=['Date', 'cat'], na_position='first')
 .drop(columns='cat')
 .groupby('Name', as_index=False).last()
)

output:

    Name  Date     Plan
0  Daron  2019  College
1     JQ  2019     None
2  James  2022     Work
3   John  2022  College
4    Kel  2021  College
5    Mel  2021     Work
6  Shama  2018     Work

CodePudding user response：

Let us sort the values then drop the duplicates in Name

df['Plan'] = pd.Categorical(df['Plan'], ['College', 'Work'], True)
df.sort_values(['Date', 'Plan'], ascending=[0, 1]).drop_duplicates('Name')

    Date   Name     Plan
0   2022   John  College
3   2022  James     Work
2   2021    Kel  College
10  2021    Mel     Work
4   2019  Daron  College
5   2019     JQ      NaN
11  2018  Shama     Work