I have four dataframes with the following structure:
df1
max_proba chosen_class
0 0.8 class_A
1 0.92 class_B
2 0.82 class_B
3 0.74 class_B
4 0.58 class_A
df2
max_proba chosen_class
0 0.6 class_C
1 0.62 class_D
2 0.87 class_D
3 0.94 class_C
4 0.62 class_D
# ... and same for df3 and df4 only chosen class values and probabilities that change!
I want to compare between columns "max_proba" between all the 4 dataframes and keep the maximum value with it's chosen class.
( for example: one sample, if df1 max_proba = 0,23 ,df2 max_proba = 0,86, df3 max_proba = 0,56, df4 max_proba = 76 ==> here I want only the chosen class with highest probability 0,86 which can be class_E (for example))
CodePudding user response:
If I got you right, you want to compare them row by row.
You should join them into one data frame:
df = df1.append(df2)
Then make a new columns 'index' with number of row in previous dataframes and column 'level_0' with number of row in this dataframe:
df = df.reset_index()
df = df.reset_index()
And find the indexes of rows with maximum for each index:
indexes = df.groupby('index').apply(lambda x: x.max_proba == max(x['max_proba'])).reset_index()
Finally, select rows with maximum max_proba from the big data frame with our indexes:
result = df.loc[indexes[indexes.max_proba].level_1.values]
The output will be like:
level_0 index max_proba chosen_class
0 0 0 0.80 class_A
1 1 1 0.92 class_B
7 7 2 0.87 class_D
8 8 3 0.94 class_C
9 9 4 0.62 class_D
You can drop extra columns with function drop.