I have a DataFrame column with 3 values - Bart, Peg, Human. I need to one-hot encode them such that Bart and Peg stay as columns and human is represented as 0 0.
Xi | Architecture
0 | Bart
1 | Bart
2 | Peg
3 | Human
4 | Human
5 | Peg
..
.
I want to one-hot encode them so that Human is represented as 0 0:
Xi |Bart| Peg
0 | 1 | 0
1 | 1 | 0
2 | 0 | 1
3 | 0 | 0
4 | 0 | 0
5 | 0 | 1
But when I do :
pd.get_dummies(df['Architecture'], drop_first = True)
it removes "Bart" and keeps the other 2. Is there a way to specify which column to remove?
CodePudding user response:
You could mask
it:
df = df[['Xi']].join(pd.get_dummies(df['Architecture'].mask(df['Architecture']=='Human')))
Output:
Xi Bart Peg
0 0 1 0
1 1 1 0
2 2 0 1
3 3 0 0
4 4 0 0
5 5 0 1
CodePudding user response:
IIUC, try use get_dummies then drop 'Human' column:
df['Architecture'].str.get_dummies().drop('Human', axis=1)
Output:
Bart Peg
0 1 0
1 1 0
2 0 1
3 0 0
4 0 0
5 0 1
CodePudding user response:
It's dropping "Bart" because that's the "first" label it sees.
get_dummies
doesn't have a built in way to say "drop this column after". It is annoying.
So you can do a few things:
- sort the dataset before using
get_dummies
so "Human" shows up first when you usedrop first
- subset the dataset to only one-hot-encode the columns where (architecture = "Bart" or "Peg")