Home > Back-end >  One hot encode columns of lists including additional confidence decimal number
One hot encode columns of lists including additional confidence decimal number

Time:10-20

I have a table that I want to one hot encode. I can do that using pandas get_dummies or sklearn MultiLabelBinarizer as described in e.g. this stackoverflow post. The normal approach looks like this:

    categories              a   b   c   e
0   [a, b, c]           0   1   1   1   0
1   [c]          --->   1   0   0   1   0
2   [b, c, e]           2   0   1   1   1

However, in my case I also have confidences attached to my categories like this.

    categories
0   [{a:0.3}, {b:0.4}, {c:0.5}]
1   [{c:0.8}]
2   [{b:1}, {c:1}, {e:0.1}]

I would like to incorporate that knowledge in my decision tree classifer. I.e. I would like to get my data on this format:

    a   b   c   e
0   0.3 0.4 0.5 0
1   0   0   0.8 0
2   0   1.0 1.0 0.1

I could first build the normal one hot encoded table and then change the values afterwards by going through all rows. However, I was hoping there would be an easier way.

How can I one hot encode the table above and incorporate the additional information of the category confidences?

CodePudding user response:

Use dictionary comprehension for flatten values of dictionaries:

df = (pd.DataFrame([{k: v for d in x for k, v in d.items()} for x in df['categories']])
        .fillna(0))
print (df)
     a    b    c    e
0  0.3  0.4  0.5  0.0
1  0.0  0.0  0.8  0.0
2  0.0  1.0  1.0  0.1
  • Related