I have a table that I want to one hot encode. I can do that using pandas
get_dummies
or sklearn
MultiLabelBinarizer
as described in e.g. this stackoverflow post. The normal approach looks like this:
categories a b c e
0 [a, b, c] 0 1 1 1 0
1 [c] ---> 1 0 0 1 0
2 [b, c, e] 2 0 1 1 1
However, in my case I also have confidences attached to my categories like this.
categories
0 [{a:0.3}, {b:0.4}, {c:0.5}]
1 [{c:0.8}]
2 [{b:1}, {c:1}, {e:0.1}]
I would like to incorporate that knowledge in my decision tree classifer. I.e. I would like to get my data on this format:
a b c e
0 0.3 0.4 0.5 0
1 0 0 0.8 0
2 0 1.0 1.0 0.1
I could first build the normal one hot encoded table and then change the values afterwards by going through all rows. However, I was hoping there would be an easier way.
How can I one hot encode the table above and incorporate the additional information of the category confidences?
CodePudding user response:
Use dictionary comprehension for flatten values of dictionaries:
df = (pd.DataFrame([{k: v for d in x for k, v in d.items()} for x in df['categories']])
.fillna(0))
print (df)
a b c e
0 0.3 0.4 0.5 0.0
1 0.0 0.0 0.8 0.0
2 0.0 1.0 1.0 0.1