I have a dataframe with duplicate employee codes. I wanted to have only one record per employee codes.
df1:
Emp_code|Education Level|StartEducation|EndEducation
001 |Diploma |Jun-1995|Jun 1999
001 |Professional |May-2002|May 2006
002 |PostGraduate | - | -
002 |Diploma&Cert | - | -
003 |PostGraduate |Jun-2008| -
003 |Diploma |Aug-2005| -
004 |Graduate/Equi|- | Mar-2012
004 |Professional |- | Aug-2014
Expected Output:
Emp_code|Education Level|StartEducation|EndEducation
001 |Professional |May-2002|May-2006
002 |PostGraduate |- |-
003 |PostGraduate |Jun-2008|-
004 |Professional |- |Aug-2014
Filtering order:
Emp_code where the EndEducation date is maximum should be selected first
If EndEducation date is '-', select the emp_code based on max Start Education date.
If StartEDucation date is also '-' , the selection should be based on Education level column .
Education level filter priority order:
- Professional
- PostGraduate
- Diploma&Cert
- Diploma
- Graduation/Equi
If there are no dates in start and end education select the emp_code where the Education level is "Professional", else select PostGraduate, else select Diploma&Cert and so on..
The idea is to have only the highest education details in the dataframe.
I am able to do individual filters like the below
df1.groupby(Emp_code)[EndEducation].max()
df1.groupby(Emp_code)[StartEducation].max()
df1[df1[Education Level=='Professional']]
df1[df1[Education Level=='PostGraduate']]
df1[df1[Education Level=='Professional']]
df1[df1[Education Level=='Diploma&Cert']]
df1[df1[Education Level=='Diploma']]
df1[df1[Education Level=='Graduation/Equi']]
But i am unable to apply all this filters at one shot.
Any help would be appreciated.
CodePudding user response:
You can use a Categorical:
order = ['Professional', 'PostGraduate', 'Diploma&Cert', 'Diploma', 'Graduation/Equi']
out = df.loc[(df.assign(Education=pd.Categorical(df['Education Level'], categories=order[::-1], ordered=True))
.groupby('Emp_code')['Education'].idxmax()
)]
Output:
Emp_code Education Level StartEducation EndEducation
1 1 Professional May-2002 May 2006
2 2 PostGraduate - -
4 3 PostGraduate Jun-2008 -
7 4 Professional - Aug-2014
CodePudding user response:
You should use "Categorical":
order = ['Professional', 'PostGraduate', 'Diploma&Cert', 'Diploma', 'Graduation/Equi']
out = df.loc[(df.assign(Education=pd.Categorical(df['Education Level'], categories=order[::-1], ordered=True)) .groupby('Emp_code')['Education'].idxmax() )]
Output -->
Emp_code Education Level StartEducation EndEducation 1 1 Professional May-2002 May 2006 2 2 PostGraduate - - 4 3 PostGraduate Jun-2008 - 7 4 Professional - Aug-2014