I have a DataFrame, containing results of an ensemble classification model. This gives me for each prediction a Label
and a Conf
.
Now I want to create the Label_Final
by the following logical:
- Most frequent value exists --> take this one
- Two most frequent values --> take the one with larger
Conf
sum - No most frequent value --> take the one with largest
Conf
- No most frequent value and more than one
Conf
with same value, just take one (random/first, doesn't matter)
Let assume the data lookes like:
---- ---------- --------- ---------- --------- ---------- --------- ---------- ---------
| | Label1 | Conf1 | Label2 | Conf2 | Label3 | Conf3 | Label4 | Conf4 |
|---- ---------- --------- ---------- --------- ---------- --------- ---------- ---------|
| 0 | Class1 | 0.9 | Class2 | 0.8 | Class2 | 0.8 | Class2 | 0.8 |
| 1 | Class2 | 0.8 | Class2 | 0.2 | Class2 | 0.5 | Class2 | 0.5 |
| 2 | Class1 | 0.7 | Class2 | 0.5 | Class3 | 0.9 | Class4 | 0.8 |
| 3 | Class4 | 0.6 | Class4 | 0.2 | Class3 | 0.4 | Class3 | 0.2 |
---- ---------- --------- ---------- --------- ---------- --------- ---------- ---------
The calculation would be:
Row[0]
, 3 xClass2
- --> take
Class2
- --> take
Row[1]
, 4 xClass2
- --> take
Class2
- --> take
Row[2]
, no most frequent one- highest
Conf
is0.9
- --> take
Class3
- highest
Row[3]
, 2 xClass3
& 2 xClass4
Class4
Conf
=0.6 0.2 = 0.8
Class3
Conf
=0.4 0.2 = 0.6
- --> take
Class4
Because I need to look at the Conf
to select the Label
I don't know how to solve it with pd.mode()
and pd.nunique()
- I don't even know if it's possible with this functions. Can somebody help?
MWE
import pandas as pd
df = pd.DataFrame({
"Label1":["Class1", "Class2", "Class3", "Class4"],
"Conf1":[0.9,0.8,0.7,0.6],
"Label2":["Class2", "Class2", "Class2", "Class4"],
"Conf2":[0.8,0.2,0.5,0.2],
"Label3":["Class2", "Class2", "Class3", "Class3"],
"Conf3":[0.8,0.5,0.9,0.4],
"Label4":["Class2", "Class2", "Class3", "Class3"],
"Conf4":[0.8,0.5,0.8,0.2]
})
df["Label_Final"] = ["Class2", "Class2", "Class3", "Class4"]
CodePudding user response:
You cound reshape to longer, then count the number of classes, and sum the confs, Finally sort by these two in a descending order then select the first one in each case.
(pd.wide_to_long(df.reset_index(), ['Label', 'Conf'], 'index', 'grp')
.reset_index().groupby(['index', 'Label'])
.agg(count=('Label', 'count'), Conf=('Conf', 'sum'))
.reset_index().sort_values(['count', 'Conf'], ascending = [False, False])
.groupby('index').head(1).rename(columns = {'Label':'Final_Label'})
.merge(df.reset_index(), on='index', how='right'))
index Final_Label count Conf Label1 ... Conf2 Label3 Conf3 Label4 Conf4
0 1 Class2 3 2.4 Class1 ... 0.8 Class2 0.8 Class2 0.8
1 2 Class2 4 2.0 Class2 ... 0.2 Class2 0.5 Class2 0.5
2 3 Class3 1 0.9 Class1 ... 0.5 Class3 0.9 Class4 0.8
3 4 Class4 2 0.8 Class4 ... 0.2 Class3 0.4 Class3 0.2