Morning!
I want to find the maximum value for each row of my dataset.
Example:
ID | Age | Col_A | Col_B | Col_C |
---|---|---|---|---|
1 | 20 | 1 | 5 | 3 |
2 | NaN | 6 | 8 | 9 |
3 | 25 | 5 | 6 | 7 |
4 | 30 | 3 | 4 | 6 |
5 | NaN | 6 | 2 | 1 |
6 | 27 | 1 | 8 | 4 |
Then, for ID #2 knowing that the max value from the subset [Col_A; Col_B; Col_C] is 9 from Col_C I want to input the missing value as the mean Age of the ID's which also had the max value in Col_C.
In this case, the Age of ID #2 will be the average from Age ID#3 and ID#4 which will be 28.
CodePudding user response:
You can create a temporary column which shows which is the max column for each ID using idxmax
and perform it column-wise, (axis=1)
, using only the Col_
columns.
Then impute the missing Age with a grouped average on the new column, using fillna
and groupby.transform
:
df['max_col'] = df.filter(like='Col_').idxmax(axis=1)
df['Age_filled'] = round(df['Age'].fillna(df.groupby('max_col')['Age'].transform('mean')))
Prints:
ID Age Col_A Col_B Col_C max_col
0 1 20.0 1 5 3 Col_B
1 2 28.0 6 8 9 Col_C
2 3 25.0 5 6 7 Col_C
3 4 30.0 3 4 6 Col_C
4 5 NaN 6 2 1 Col_A
5 6 27.0 1 8 4 Col_B
For ID = 5
, there is no other ID which has the maximum value in Col_A. So for such occasions, it is still left np.nan