How to ignore the duplicate records based on multiple conditions in a dataframe?-CodePudding

I have a dataframe with duplicate employee codes. I wanted to have only one record per employee codes.

df1:
Emp_code|Education Level|StartEducation|EndEducation

001 |Diploma      |Jun-1995|Jun 1999
001 |Professional |May-2002|May 2006
002 |PostGraduate | -      | -
002 |Diploma&Cert | -      | -
003 |PostGraduate |Jun-2008| -
003 |Diploma      |Aug-2005| -
004 |Graduate/Equi|-       | Mar-2012
004 |Professional |-       | Aug-2014

Expected Output:

Emp_code|Education Level|StartEducation|EndEducation

001 |Professional |May-2002|May-2006
002 |PostGraduate |-       |-
003 |PostGraduate |Jun-2008|-
004 |Professional |-       |Aug-2014

Filtering order:

Emp_code where the EndEducation date is maximum should be selected first

If EndEducation date is '-', select the emp_code based on max Start Education date.

If StartEDucation date is also '-' , the selection should be based on Education level column .

Education level filter priority order:

Professional
PostGraduate
Diploma&Cert
Diploma
Graduation/Equi

If there are no dates in start and end education select the emp_code where the Education level is "Professional", else select PostGraduate, else select Diploma&Cert and so on..

The idea is to have only the highest education details in the dataframe.

I am able to do individual filters like the below

df1.groupby(Emp_code)[EndEducation].max()

df1.groupby(Emp_code)[StartEducation].max()

df1[df1[Education Level=='Professional']]
df1[df1[Education Level=='PostGraduate']]
df1[df1[Education Level=='Professional']]
df1[df1[Education Level=='Diploma&Cert']]
df1[df1[Education Level=='Diploma']]
df1[df1[Education Level=='Graduation/Equi']]

But i am unable to apply all this filters at one shot.

Any help would be appreciated.

CodePudding user response：

You can use a Categorical:

order = ['Professional', 'PostGraduate', 'Diploma&Cert', 'Diploma', 'Graduation/Equi']

out = df.loc[(df.assign(Education=pd.Categorical(df['Education Level'], categories=order[::-1], ordered=True))
               .groupby('Emp_code')['Education'].idxmax()
             )]

Output:

   Emp_code Education Level StartEducation EndEducation
1         1    Professional       May-2002     May 2006
2         2    PostGraduate              -            -
4         3    PostGraduate       Jun-2008            -
7         4    Professional              -     Aug-2014

CodePudding user response：

You should use "Categorical":

order = ['Professional', 'PostGraduate', 'Diploma&Cert', 'Diploma', 'Graduation/Equi']

out = df.loc[(df.assign(Education=pd.Categorical(df['Education Level'], categories=order[::-1], ordered=True)) .groupby('Emp_code')['Education'].idxmax() )]

Output -->

Emp_code Education Level StartEducation EndEducation 1 1 Professional May-2002 May 2006 2 2 PostGraduate - - 4 3 PostGraduate Jun-2008 - 7 4 Professional - Aug-2014