Home > OS >  One Hot Encoder on columns
One Hot Encoder on columns

Time:12-02

The data available is as follows:

    bread   milk    butter  jam  nutella    cheese  chips
0   bread   NaN     butter  jam  nutella    NaN     NaN
1   NaN     NaN     butter  jam  nutella    NaN     chips
2   NaN     milk    NaN     NaN  NaN        cheese  NaN
3   bread   milk    butter  jam  nutella    cheese  chips
4   bread   milk    NaN     NaN  nutella    NaN     NaN
5   bread   milk    butter  jam  NaN        cheese  chips
6   bread   milk    NaN     NaN  nutella    NaN     NaN
7   bread   NaN     butter  NaN  NaN        cheese  NaN
8   bread   NaN     butter  jam  nutella    NaN     NaN
9   NaN     milk    butter  jam  NaN        cheese  NaN
10  bread   NaN     NaN     jam  nutella    cheese  chips
11  bread   milk    butter  jam  nutella    NaN     NaN
12  bread   NaN     butter  NaN  nutella    cheese  NaN
13  bread   NaN     butter  jam  nutella    cheese  chips
14  bread   milk    butter  jam  nutella    cheese  chips
15  NaN     milk    butter  jam  nutella    cheese  NaN
16  NaN     milk    NaN     jam  nutella    cheese  NaN
17  bread   milk    butter  jam  nutella    cheese  chips
18  bread   NaN     butter  jam  nutella    cheese  NaN
19  bread   milk    butter  NaN  nutella    cheese  NaN
20  NaN     milk    NaN     NaN  NaN        NaN     chips

I want to one hot encode each column to produce something as follows for all column, the entire dataset:

bread milk butter jam nutella cheese chips
1 0 1 1 1 0 0
0 0 1 1 1 0 1

Can someone please help me with the code?

So I tried to use the following code:

pd.get_dummies(book_data, columns = ['bread', 'milk','butter','jam', 'nutella','cheese','chips'])

I obtained the following error:

KeyError: "['bread', 'cheese'] not in index"

CodePudding user response:

You can use a trick with pandas.Series.name to replace the column name with 1, then fillna(0).

First make sure to clean up the column names with:

book_data.columns= book_data.columns.str.strip()

And why not also the values of each row :

book_data= book_data.replace("\s ", "", regex=True)

Then try this :

out= book_data.apply(lambda x: x.replace(x.name, 1), axis=0).fillna(0).astype(int)

# Output :

print(out)

    bread  milk  butter  jam  nutella  cheese  chips
0       1     0       1    1        1       0      0
1       0     0       1    1        1       0      1
2       0     1       0    0        0       1      0
3       1     1       1    1        1       1      1
4       1     1       0    0        1       0      0
5       1     1       1    1        0       1      1
6       1     1       0    0        1       0      0
7       1     0       1    0        0       1      0
8       1     0       1    1        1       0      0
9       0     1       1    1        0       1      0
10      1     0       0    1        1       1      1
11      1     1       1    1        1       0      0
12      1     0       1    0        1       1      0
13      1     0       1    1        1       1      1
14      1     1       1    1        1       1      1
15      0     1       1    1        1       1      0
16      0     1       0    1        1       1      0
17      1     1       1    1        1       1      1
18      1     0       1    1        1       1      0
19      1     1       1    0        1       1      0
20      0     1       0    0        0       0      1
  • Related