Drop column by unique ID value in dataframe-CodePudding

I have a dataframe with an ID column, and 2 other columns named "Legal Employer" and "Legacy Legal Employer" like this:

ID  Legal Employer  Legacy Legal Employer
1   Warehouse       Warehouse 4
1   Pool            Warehouse 4
1   Drive           Warehouse 4
2   Warehouse   
2   Drive           Warehouse
3   Warehouse       Drive
3   Pool    
4   Drive           Drive 4
4   Warehouse       Drive 4

For each unique ID, I want to keep only 1 row, considering the following:

If for that ID there is a blank (NaN) on column on "Legacy Legal Employer", keep that row and remove the others
If for that ID there is no blank (NaN) on column on "Legacy Legal Employer", keep the first record and remove others

This would result in this dataframe:

ID  Legal Employer  Legacy Legal Employer
1   Warehouse       Warehouse 4
2   Warehouse   
3   Pool    
4   Drive           Drive 4

What is the best way to achive this?

Thank you!

CodePudding user response：

Solution

Create a boolean mask to check for null values in Legacy Legal Employer then group this mask and use idxmax to find the index of max value(i.e. True) per group then use this index to filter the rows. PS: In case there is no null value in column, idxmax will automatically select the first row.

df.loc[df['Legacy Legal Employer'].isna().groupby(df['ID']).idxmax()]

Alternative solution

Sort the dataframe by the occurrence of null values in Legacy Legal Employer then drop the duplicates by ID and optionally sort the dataframe by ID

m = df['Legacy Legal Employer'].isna()
df.iloc[np.argsort(~m)].drop_duplicates('ID').sort_values('ID')

Result

   ID Legal Employer Legacy Legal Employer
0   1      Warehouse           Warehouse 4
3   2      Warehouse                  None
6   3           Pool                  None
7   4          Drive               Drive 4

CodePudding user response：

Another possible solution:

(df.groupby('ID', group_keys=False)
 .apply(lambda g:  g.loc[g['Legacy Legal Employer'].isna()]
        if g['Legacy Legal Employer'].isna().any() else g.head(1)))

Output:

   ID Legal Employer Legacy Legal Employer
0   1      Warehouse           Warehouse 4
3   2      Warehouse                   NaN
6   3           Pool                   NaN
7   4          Drive               Drive 4