How to create a one-hot-encoding for the intermediate class?-CodePudding

Let's say I have 3 classes: 0, 1, 2
One-hot-encoding an array of labels can be done via pandas as follows:

What I'm interested in, is how to get an encoding that can handle an intermediate class, e.g. class in the middle between 2 classes.
For example:

for class 0.4, resulting encoding should be [0.4, 0.6, 0]
for class 1.8, resulting encoding should be [0, 0.2, 0.8]

Does anybody know such an encoder?
Thanks for your answer!

CodePudding user response：

You can write a function for your strange encoding like the below:

import numpy as np
import math

def strange_encode(num, cnt_lbl):
    encode_arr = np.zeros(cnt_lbl)
    lbl = math.floor(num)
    if lbl != num:
        num -= lbl
        if num >= 0.5:
            encode_arr[lbl:lbl 2] = [1-num, num]
        else:
            encode_arr[lbl:lbl 2] = [num, 1-num]
    else:
        encode_arr[lbl] = 1
    return encode_arr

Output:

>>> encode(0.0, cnt_lbl=3)
array([1., 0., 0.])

>>> encode(2.0, cnt_lbl=3)
array([0., 0., 1.])

>>> encode(0.4, cnt_lbl=3)
array([0.4, 0.6, 0. ])

>>> encode(1.8, cnt_lbl=3)
array([0. , 0.2, 0.8])

# You can change the count of classes
>>> encode(2.5, cnt_lbl=4)
array([0. , 0. , 0.5, 0.5])

>>> encode(1.6, cnt_lbl=4)
array([0. , 0.4, 0.6, 0. ])

>>> encode(2, cnt_lbl=4)
array([0., 0., 1., 0.])

We can write a function for generating a dataframe for encoding like below:

import pandas as pd
def generate_df_encoding(arr_nums, num_classes):
    arr = np.zeros((len(arr_nums), num_classes))
    for idx, num in enumerate(arr_nums):
        arr[idx] = strange_encode(num, cnt_lbl=num_classes)
    return pd.DataFrame(arr)

Output:

>>> generate_df_encoding([0,0,1,2,0.4,1.8], num_classes=3)

    0    1      2
0   1.0  0.0    0.0
1   1.0  0.0    0.0
2   0.0  1.0    0.0
3   0.0  0.0    1.0
4   0.4  0.6    0.0
5   0.0  0.2    0.8

CodePudding user response：

I'm not sure if this helps.

When you have categorial variables (classes), without order, that need to be encoded, you can use sklearn.preprocessing.LabelEncoder(). If there's an order, try using sklearn.preprocessing.OrdinalEncoder().

When you have numeric variables, the preprocessing steps I would use would be one of the many preprocessing tools you can find at https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing ; a couple widely used are MinMaxScaler(), RobustScaler(), StandardScaler(), but it's best you read the API Reference guide to better understand when to use one or the other.