Home > Software engineering >  Python Pandas: How to choose a certain option within duplicates
Python Pandas: How to choose a certain option within duplicates

Time:07-21

My data (df) looks like this:

Date Name Plan
2022 John College
2022 John Work
2021 Kel College
2022 James Work
2019 Daron College
2019 JQ NaN
2020 Mel College
2017 Shama Work
2021 John Nan
2020 John Work
2021 Mel Work
2018 Shama Work

My end result needs one plan (the most recent one), per one name.

Currently I: Drop all Plan NaN values, then sort by service date, and drop all but the most recent date using this code:

df = df.dropna(subset=['Plan'])
df = df.sort_values('Date').drop_duplicates('Name', keep='last')

This mostly works, but I need 'College' to take precedence over 'Work' when the two are put together on the same date. In the data above, this row: | 2022 | John |Work | would be the one kept from dropping duplicates and not the one with 'College'.

Everything works, except this little part where the dates are duplicated AND there are two differing plans.

In a non pandas setting I would think this:

if service dates are duplicated AND one == college AND other == anything else: then keep the one with college

The end result I need:

Date Name Plan
2022 John College
2021 Kel College
2022 James Work
2019 Daron College
2019 JQ NaN
2021 Mel Work
2018 Shama Work

Let me know if that makes sense, Thank you!

CodePudding user response:

You can use a custom sort for "plan", where you give priority to "College" over "Work". Here taking advantage of an ordered Categorical, but you could also go with a mapping from a dictionary:

(df
 .assign(cat=pd.Categorical(df['Plan'], categories=['Work', 'College'],
                            ordered=True))
 .sort_values(by=['Date', 'cat'], na_position='first')
 .drop(columns='cat')
 .groupby('Name', as_index=False).last()
)

output:

    Name  Date     Plan
0  Daron  2019  College
1     JQ  2019     None
2  James  2022     Work
3   John  2022  College
4    Kel  2021  College
5    Mel  2021     Work
6  Shama  2018     Work

CodePudding user response:

Let us sort the values then drop the duplicates in Name

df['Plan'] = pd.Categorical(df['Plan'], ['College', 'Work'], True)
df.sort_values(['Date', 'Plan'], ascending=[0, 1]).drop_duplicates('Name')

    Date   Name     Plan
0   2022   John  College
3   2022  James     Work
2   2021    Kel  College
10  2021    Mel     Work
4   2019  Daron  College
5   2019     JQ      NaN
11  2018  Shama     Work
  • Related