I have the following dataframe.
ID path1 path2 path3
1 12 NaN NaN
1 1 5 NaN
1 2 NaN ''
1 2 4 111
2 123 NaN NaN
3 11 25 NaN
3 1 NaN NaN
3 21 34 NaN
3 NaN NaN NaN
I want to update column values based on ID if there are NaN values. first priority is path1, path2 and path3 have values then keep. check which column has more results keep that column.
What is the best way to get the result of the dataframe below?
ID path1 path2 path3
1 2 4 111
2 123 NaN NaN
3 11 25 NaN
3 21 34 NaN
CodePudding user response:
Group the dataframe by ID
then get all the records which have maximum number of non NaN
values in the rows by applying a function to the group.
>>> (df.groupby('ID')
.apply(lambda x: x.loc[x.notna().sum(axis=1).max() == x.notna().sum(axis=1)])
.reset_index(drop=True))
ID path1 path2 path3
0 1 2.0 4.0 111
1 2 123.0 NaN NaN
2 3 11.0 25.0 NaN
3 3 21.0 34.0 NaN
Here is slightly modified version as above code creates mask and calls sum twice, it can be avoided by use of normal function than a lambda function:
def get_rows(df):
counts = df.notna().sum(axis=1)
return df.loc[counts==counts.max()]
df.groupby('ID').apply(get_rows).reset_index(drop=True)
ID path1 path2 path3
0 1 2.0 4.0 111
1 2 123.0 NaN NaN
2 3 11.0 25.0 NaN
3 3 21.0 34.0 NaN