I have a dataframe with an ID column, and 2 other columns named "Legal Employer" and "Legacy Legal Employer" like this:
ID Legal Employer Legacy Legal Employer
1 Warehouse Warehouse 4
1 Pool Warehouse 4
1 Drive Warehouse 4
2 Warehouse
2 Drive Warehouse
3 Warehouse Drive
3 Pool
4 Drive Drive 4
4 Warehouse Drive 4
For each unique ID, I want to keep only 1 row, considering the following:
- If for that ID there is a blank (NaN) on column on "Legacy Legal Employer", keep that row and remove the others
- If for that ID there is no blank (NaN) on column on "Legacy Legal Employer", keep the first record and remove others
This would result in this dataframe:
ID Legal Employer Legacy Legal Employer
1 Warehouse Warehouse 4
2 Warehouse
3 Pool
4 Drive Drive 4
What is the best way to achive this?
Thank you!
CodePudding user response:
Solution
Create a boolean mask to check for null values in Legacy Legal Employer
then group this mask and use idxmax
to find the index of max value(i.e. True) per group then use this index to filter the rows. PS: In case there is no null value in column, idxmax
will automatically select the first row.
df.loc[df['Legacy Legal Employer'].isna().groupby(df['ID']).idxmax()]
Alternative solution
Sort the dataframe by the occurrence of null values in Legacy Legal Employer
then drop the duplicates by ID
and optionally sort the dataframe by ID
m = df['Legacy Legal Employer'].isna()
df.iloc[np.argsort(~m)].drop_duplicates('ID').sort_values('ID')
Result
ID Legal Employer Legacy Legal Employer
0 1 Warehouse Warehouse 4
3 2 Warehouse None
6 3 Pool None
7 4 Drive Drive 4
CodePudding user response:
Another possible solution:
(df.groupby('ID', group_keys=False)
.apply(lambda g: g.loc[g['Legacy Legal Employer'].isna()]
if g['Legacy Legal Employer'].isna().any() else g.head(1)))
Output:
ID Legal Employer Legacy Legal Employer
0 1 Warehouse Warehouse 4
3 2 Warehouse NaN
6 3 Pool NaN
7 4 Drive Drive 4