How to rank the categorical values while one-hot-encoding-CodePudding

I have the data like this.

id	feature_1	feature_2
1	a	e
2	b	c
3	c	d
4	d	b
5	e	a

I want the one-hot-encoded like feature with the first column representing 1 and the second column representing 0.5. Like the following table.

id	a	b	c	d	e
1	1	0	0	0	0.5
2	0	1	0.5	0	0
3	0	0	1	0.5	0
4	0	0.5	0	1	0
5	0.5	0	0	0	1

But when applying sklearn.preprocessing.OneHotEncoder it outputs 10 columns with respective 1s.

How can I achieve this?

CodePudding user response：

For the two columns, you can do:

pd.crosstab(df.id, df.feature_1)   pd.crosstab(df['id'], df['feature_2']) * .5

Output:

feature_1    a    b    c    d    e
id                                
1          1.0  0.0  0.0  0.0  0.5
2          0.0  1.0  0.5  0.0  0.0
3          0.0  0.0  1.0  0.5  0.0
4          0.0  0.5  0.0  1.0  0.0
5          0.5  0.0  0.0  0.0  1.0

If you have more than two features, with the weights defined, then you can melt then map the features to the weights:

weights = {'feature_1':1, 'feature_2':0.5}
flatten = df.melt('id')

(flatten['variable'].map(weights)
     .groupby([flattern['id'], flatten['value']])
     .sum().unstack('value', fill_value=0)
)