Home > Software design >  python pandas one- hot encoding in several columns for the same question
python pandas one- hot encoding in several columns for the same question

Time:01-10

I have to perform one-hot encoding on the data frames that contain the survey results.

In this data frames, they represented 'multiple response' in multiple columns. for example, if someone choose '1' and '3', it is represented by '1'in first columns and '3' in second columns.

I need to organize the data like this way

Quesion1 Quesion1
1 3
2 4
1 5

-->

1 2 3 4 5
1 0 1 0 0
0 1 0 1 0
1 0 0 0 1

I tried to use 'get_dummies' to do one-hot encoding. but two columns are treated as different question. ('2' in first question and '2' in second question are treated like different values)

Is there any good solution to solve this problem?

CodePudding user response:

>>> pd.get_dummies(df.stack()).groupby(level=0).sum().clip(upper=1)

   1  2  3  4  5
0  1  0  1  0  0
1  0  1  0  1  0
2  1  0  0  0  1
  • flatten the frame with stack
  • get the dummies
  • groupby each row (now at level=0)
  • sum the 1s but max-clip at 1 for possible multiple occurences (i.e., same answers for questions)

CodePudding user response:

Using stack and crosstab:

s = df.stack().droplevel(1)

out = pd.crosstab(s.index, s).rename_axis(index=None, columns=None)

Output:

   1  2  3  4  5
0  1  0  1  0  0
1  0  1  0  1  0
2  1  0  0  0  1

CodePudding user response:

You need to use .get_dummies() to converto to one-hot for each column and then join them using pd.concat():

df2 = pd.concat([pd.get_dummies(df.loc[:, column]) for column in df.columns], axis=1)
print(df2)

Output:

   1  2  3  4  5
0  1  0  1  0  0
1  0  1  0  1  0
2  1  0  0  0  1
  • Related