I have to perform one-hot encoding on the data frames that contain the survey results.
In this data frames, they represented 'multiple response' in multiple columns. for example, if someone choose '1' and '3', it is represented by '1'in first columns and '3' in second columns.
I need to organize the data like this way
Quesion1 | Quesion1 |
---|---|
1 | 3 |
2 | 4 |
1 | 5 |
-->
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
1 | 0 | 1 | 0 | 0 |
0 | 1 | 0 | 1 | 0 |
1 | 0 | 0 | 0 | 1 |
I tried to use 'get_dummies' to do one-hot encoding. but two columns are treated as different question. ('2' in first question and '2' in second question are treated like different values)
Is there any good solution to solve this problem?
CodePudding user response:
>>> pd.get_dummies(df.stack()).groupby(level=0).sum().clip(upper=1)
1 2 3 4 5
0 1 0 1 0 0
1 0 1 0 1 0
2 1 0 0 0 1
- flatten the frame with
stack
- get the dummies
- groupby each row (now at level=0)
- sum the 1s but max-clip at 1 for possible multiple occurences (i.e., same answers for questions)
CodePudding user response:
s = df.stack().droplevel(1)
out = pd.crosstab(s.index, s).rename_axis(index=None, columns=None)
Output:
1 2 3 4 5
0 1 0 1 0 0
1 0 1 0 1 0
2 1 0 0 0 1
CodePudding user response:
You need to use .get_dummies()
to converto to one-hot for each column and then join them using pd.concat()
:
df2 = pd.concat([pd.get_dummies(df.loc[:, column]) for column in df.columns], axis=1)
print(df2)
Output:
1 2 3 4 5
0 1 0 1 0 0
1 0 1 0 1 0
2 1 0 0 0 1