Home > Enterprise >  How to do one-hot encoding of two similar columns into one?
How to do one-hot encoding of two similar columns into one?

Time:09-10

I have a dataframe where there are multiple columns that can have the same values (categorical variables), and I'd like to transform these values into a numerical value (binary). I have been trying to use the pd.get_dummies() function to achieve this, but I end up with lots of repetitive columns in the end (e.g. Color1_green and Color2_green).

An example dataframe of my input would be something like:

        User         Color1       Color2       Color3
0      Username1      green        red          blue
1      Username2      red          blue         NaN
2      Username3      green        yellow       NaN

As you can see, the variables Color1, Color2 and Color3 hold the same possible values, and they won't repeat values (so if Color1 is red, Color2 cannot be red).

What I'm trying to achieve is performing a one-hot encoding on these three color columns in order to get the following result:

        User          green       red       blue      yellow
0      Username1        1          1         1           0
1      Username2        0          1         1           0
2      Username3        1          0         0           1

Is there some way to this type of one-hot encoding using pandas?

CodePudding user response:

You can stack, get_dummies and aggregate with max

out = df[['User']].join(
pd.get_dummies(df.filter(like='Color').stack())
  .groupby(level=0).max()
)

Output:

        User  blue  green  red  yellow
0  Username1     1      1    1       0
1  Username2     1      0    1       0
2  Username3     0      1    0       1
  • Related