Home > OS >  Create label using most frequent value / weight
Create label using most frequent value / weight

Time:06-22

I have a DataFrame, containing results of an ensemble classification model. This gives me for each prediction a Label and a Conf.

Now I want to create the Label_Final by the following logical:

  • Most frequent value exists --> take this one
  • Two most frequent values --> take the one with larger Conf sum
  • No most frequent value --> take the one with largest Conf
  • No most frequent value and more than one Conf with same value, just take one (random/first, doesn't matter)

Let assume the data lookes like:

 ---- ---------- --------- ---------- --------- ---------- --------- ---------- --------- 
|    | Label1   |   Conf1 | Label2   |   Conf2 | Label3   |   Conf3 | Label4   |   Conf4 |
|---- ---------- --------- ---------- --------- ---------- --------- ---------- ---------|
|  0 | Class1   |     0.9 | Class2   |     0.8 | Class2   |     0.8 | Class2   |     0.8 |
|  1 | Class2   |     0.8 | Class2   |     0.2 | Class2   |     0.5 | Class2   |     0.5 |
|  2 | Class1   |     0.7 | Class2   |     0.5 | Class3   |     0.9 | Class4   |     0.8 |
|  3 | Class4   |     0.6 | Class4   |     0.2 | Class3   |     0.4 | Class3   |     0.2 |
 ---- ---------- --------- ---------- --------- ---------- --------- ---------- --------- 

The calculation would be:

  • Row[0], 3 x Class2
    • --> take Class2
  • Row[1], 4 x Class2
    • --> take Class2
  • Row[2], no most frequent one
    • highest Conf is 0.9
    • --> take Class3
  • Row[3], 2 x Class3 & 2 x Class4
    • Class4 Conf = 0.6 0.2 = 0.8
    • Class3 Conf = 0.4 0.2 = 0.6
    • --> take Class4

Because I need to look at the Conf to select the Label I don't know how to solve it with pd.mode() and pd.nunique() - I don't even know if it's possible with this functions. Can somebody help?


MWE

import pandas as pd


df = pd.DataFrame({
    "Label1":["Class1", "Class2", "Class3", "Class4"],
    "Conf1":[0.9,0.8,0.7,0.6],
    "Label2":["Class2", "Class2", "Class2", "Class4"],
    "Conf2":[0.8,0.2,0.5,0.2],
    "Label3":["Class2", "Class2", "Class3", "Class3"],
    "Conf3":[0.8,0.5,0.9,0.4],
    "Label4":["Class2", "Class2", "Class3", "Class3"],
    "Conf4":[0.8,0.5,0.8,0.2]
})
df["Label_Final"] = ["Class2", "Class2", "Class3", "Class4"]

CodePudding user response:

You cound reshape to longer, then count the number of classes, and sum the confs, Finally sort by these two in a descending order then select the first one in each case.

(pd.wide_to_long(df.reset_index(), ['Label', 'Conf'], 'index', 'grp')
   .reset_index().groupby(['index', 'Label'])
   .agg(count=('Label', 'count'), Conf=('Conf', 'sum'))
   .reset_index().sort_values(['count', 'Conf'], ascending = [False, False])
   .groupby('index').head(1).rename(columns = {'Label':'Final_Label'})
   .merge(df.reset_index(), on='index', how='right'))

   index Final_Label  count  Conf  Label1  ...  Conf2  Label3  Conf3  Label4  Conf4
0      1      Class2      3   2.4  Class1  ...    0.8  Class2    0.8  Class2    0.8
1      2      Class2      4   2.0  Class2  ...    0.2  Class2    0.5  Class2    0.5
2      3      Class3      1   0.9  Class1  ...    0.5  Class3    0.9  Class4    0.8
3      4      Class4      2   0.8  Class4  ...    0.2  Class3    0.4  Class3    0.2
  • Related