a client's online form outputs multi-choice selections to a single field separated by commas. is there a way to split out the values into columns and one hot encode them, I'm using pandas?
Data now:
id | ind
1 | Student, Tourism, Industrial
2 | Residential, Student, Tourism
3 | student, Tourism, Industrial, Education
How I want the data:
id | ind_student| ind_Tourism| ind_Industrial| ind_Residential| ind_Education
1 | 1 | 1 | 1 | 0 | 0
2 | 1 | 1 | 0 | 1 | 0
3 | 1 | 1 | 1 | 0 | 1
thanks for taking a look!
CodePudding user response:
Split your ind
columns into words and pivot your dataframe:
out = df.assign(ind=df['ind'].str.split(', '), dummy=1).explode('ind') \
.pivot_table('dummy', 'id', 'ind', fill_value=0).add_prefix('ind_')
print(out)
# Output:
ind ind_Education ind_Industrial ind_Residential ind_Student ind_Tourism ind_student
id
1 0 1 0 1 1 0
2 0 0 1 1 1 0
3 1 1 0 0 1 1
CodePudding user response:
Use get_dummies. However, first put each string in its row by creating list and exploding. Because there is mixed case, please capitalise the first letter and then crosstab
df=df.assign(ind=df['ind'].str.title().str.strip().str.split(',')).explode('ind')
df=pd.crosstab(df['id'], df['ind']).add_prefix('ind_')
ind ind_Education ind_Industrial ind_Residential ind_Student ind_Tourism
id
1 0 1 0 1 1
2 0 0 1 1 1
3 1 1 0 1 1