Home > OS >  One hot encoding when a column has more than one label
One hot encoding when a column has more than one label

Time:06-22

I'm looking to convert the below dataset

-----------------------
|   Name   | Subject  |
-----------------------
| Student1 | Subject1 |
| Student1 | Subject2 |
| Student2 | Subject1 |
| Student3 | Subject3 |
| Student3 | Subject4 |
| Student3 | Subject5 |
| Student4 | Subject6 |
-----------------------

to this

-------------------------------------------------------------------------------
|    Name   | Subject1 | Subject2 | Subject3 | Subject4 | Subject5 | Subject6 |
-------------------------------------------------------------------------------
|  Student1 |     1    |     1    |     0    |     0    |     0    |     0    |
|  Student2 |     1    |     0    |     0    |     0    |     0    |     0    |
|  Student3 |     0    |     0    |     1    |     1    |     1    |     0    |
|  Student4 |     0    |     0    |     0    |     0    |     0    |     1    |
-------------------------------------------------------------------------------

Is there a one liner code that can do this on a pandas dataframe?

CodePudding user response:

You can try pd.crosstab

out = pd.crosstab(df['Name'], df['Subject'])
print(out)

Subject   Subject1  Subject2  Subject3  Subject4  Subject5  Subject6
Name
Student1         1         1         0         0         0         0
Student2         1         0         0         0         0         0
Student3         0         0         1         1         1         0
Student4         0         0         0         0         0         1

CodePudding user response:

You can do crosstab

pd.crosstab(df.Name, df.Subject).reset_index()
  • Related