Pandas: Find maximum value in row and retrieve it's column position-CodePudding

Morning!

I want to find the maximum value for each row of my dataset.

Example:

ID	Age	Col_A	Col_B	Col_C
1	20	1	5	3
2	NaN	6	8	9
3	25	5	6	7
4	30	3	4	6
5	NaN	6	2	1
6	27	1	8	4

Then, for ID #2 knowing that the max value from the subset [Col_A; Col_B; Col_C] is 9 from Col_C I want to input the missing value as the mean Age of the ID's which also had the max value in Col_C.

In this case, the Age of ID #2 will be the average from Age ID#3 and ID#4 which will be 28.

CodePudding user response：

You can create a temporary column which shows which is the max column for each ID using idxmax and perform it column-wise, (axis=1), using only the Col_ columns.

Then impute the missing Age with a grouped average on the new column, using fillna and groupby.transform:

df['max_col'] = df.filter(like='Col_').idxmax(axis=1)
df['Age_filled'] = round(df['Age'].fillna(df.groupby('max_col')['Age'].transform('mean')))

Prints:

  ID   Age  Col_A  Col_B  Col_C max_col
0   1  20.0      1      5      3   Col_B
1   2  28.0      6      8      9   Col_C
2   3  25.0      5      6      7   Col_C
3   4  30.0      3      4      6   Col_C
4   5   NaN      6      2      1   Col_A
5   6  27.0      1      8      4   Col_B

For ID = 5, there is no other ID which has the maximum value in Col_A. So for such occasions, it is still left np.nan