Home > Software design >  How to rank the categorical values while one-hot-encoding
How to rank the categorical values while one-hot-encoding

Time:06-23

I have the data like this.

id feature_1 feature_2
1 a e
2 b c
3 c d
4 d b
5 e a

I want the one-hot-encoded like feature with the first column representing 1 and the second column representing 0.5. Like the following table.

id a b c d e
1 1 0 0 0 0.5
2 0 1 0.5 0 0
3 0 0 1 0.5 0
4 0 0.5 0 1 0
5 0.5 0 0 0 1

But when applying sklearn.preprocessing.OneHotEncoder it outputs 10 columns with respective 1s.

How can I achieve this?

CodePudding user response:

For the two columns, you can do:

pd.crosstab(df.id, df.feature_1)   pd.crosstab(df['id'], df['feature_2']) * .5

Output:

feature_1    a    b    c    d    e
id                                
1          1.0  0.0  0.0  0.0  0.5
2          0.0  1.0  0.5  0.0  0.0
3          0.0  0.0  1.0  0.5  0.0
4          0.0  0.5  0.0  1.0  0.0
5          0.5  0.0  0.0  0.0  1.0

If you have more than two features, with the weights defined, then you can melt then map the features to the weights:

weights = {'feature_1':1, 'feature_2':0.5}
flatten = df.melt('id')

(flatten['variable'].map(weights)
     .groupby([flattern['id'], flatten['value']])
     .sum().unstack('value', fill_value=0)
)
  • Related