I have the data like this.
id | feature_1 | feature_2 |
---|---|---|
1 | a | e |
2 | b | c |
3 | c | d |
4 | d | b |
5 | e | a |
I want the one-hot-encoded like feature with the first column representing 1 and the second column representing 0.5. Like the following table.
id | a | b | c | d | e |
---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | 0.5 |
2 | 0 | 1 | 0.5 | 0 | 0 |
3 | 0 | 0 | 1 | 0.5 | 0 |
4 | 0 | 0.5 | 0 | 1 | 0 |
5 | 0.5 | 0 | 0 | 0 | 1 |
But when applying sklearn.preprocessing.OneHotEncoder
it outputs 10 columns with respective 1s.
How can I achieve this?
CodePudding user response:
For the two columns, you can do:
pd.crosstab(df.id, df.feature_1) pd.crosstab(df['id'], df['feature_2']) * .5
Output:
feature_1 a b c d e
id
1 1.0 0.0 0.0 0.0 0.5
2 0.0 1.0 0.5 0.0 0.0
3 0.0 0.0 1.0 0.5 0.0
4 0.0 0.5 0.0 1.0 0.0
5 0.5 0.0 0.0 0.0 1.0
If you have more than two features, with the weights defined, then you can melt
then map the features to the weights:
weights = {'feature_1':1, 'feature_2':0.5}
flatten = df.melt('id')
(flatten['variable'].map(weights)
.groupby([flattern['id'], flatten['value']])
.sum().unstack('value', fill_value=0)
)