How to get One-Hot encoded matrix from a survey table and vector of answers-CodePudding

I have a surveys' answers from the participants in a pandas dataframe:

 ['A', 'B', 'C', 'A' ...],
 ['D', 'B', 'B', 'A' ...],
 ......................

 ['D', 'C', 'C', 'A' ...]]

and I have a vector of keys to the survey:

['D', 'B', 'B', 'A' ...]

I need to get a dataframe which displays the boolean results of survey like:

 [0, 1, 0, 1 ...],
 [1, 1, 1, 1 ...],
 ......................

 [1, 0, 0, 1 ...]]

I've tried to use pd.get_dummies(users_answ, keys) but that seems wrong

CodePudding user response：

You should be able to simply check the equality between the DataFrame and the list. The list should get aligned to the DataFrame across the columns:

df = pd.DataFrame([[*'ABCA'],[*'DBBA'],[*'DCCA']])
keys = [*'DBBA']

print(df)
   0  1  2  3
0  A  B  C  A
1  D  B  B  A
2  D  C  C  A

print(keys)
['D', 'B', 'B', 'A']

print(df == keys)
       0      1      2     3
0  False   True  False  True
1   True   True   True  True
2   True  False  False  True

# If you want actual integers instead of booleans
print((df == keys).astype(int))
   0  1  2  3
0  0  1  0  1
1  1  1  1  1
2  1  0  0  1

CodePudding user response：

The easiest way seems to use pandas eq function https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eq.html#pandas.DataFrame.eq

So the whole solution oneline:

users_answ.eq(keys, axis=0)

Alternative solution:

#new array
checked_answ = []
#taking each row of surveys answers df
for r in range(0, users_answ.shape[0]): 
    row = users_answ.iloc[r].tolist()
    #creating the array for this row
    p = []
    for i in range(0, len(keys)):
        if(keys[i] == row[i]):
            p.append(1)
        else:
            p.append(0)
    checked_answ.append(p)