I have a dataframe where there are multiple columns that can have the same values (categorical variables), and I'd like to transform these values into a numerical value (binary). I have been trying to use the pd.get_dummies()
function to achieve this, but I end up with lots of repetitive columns in the end (e.g. Color1_green
and Color2_green
).
An example dataframe of my input would be something like:
User Color1 Color2 Color3
0 Username1 green red blue
1 Username2 red blue NaN
2 Username3 green yellow NaN
As you can see, the variables Color1
, Color2
and Color3
hold the same possible values, and they won't repeat values (so if Color1
is red
, Color2
cannot be red
).
What I'm trying to achieve is performing a one-hot encoding on these three color columns in order to get the following result:
User green red blue yellow
0 Username1 1 1 1 0
1 Username2 0 1 1 0
2 Username3 1 0 0 1
Is there some way to this type of one-hot encoding using pandas?
CodePudding user response:
You can stack
, get_dummies
and aggregate with max
out = df[['User']].join(
pd.get_dummies(df.filter(like='Color').stack())
.groupby(level=0).max()
)
Output:
User blue green red yellow
0 Username1 1 1 1 0
1 Username2 1 0 1 0
2 Username3 0 1 0 1