I have a dataframe with columns that contains comma separated strings. I would like to create new columns similar to what one hot encoding does.
Below is a very simplistic example. In my use case, I have thousands of rows with more columns, and two columns containing comma separated many strings. I could have used apply lamda function string contains condition to create each column but that is very tedious as it will be 100s of new columns
Input Datafarme
ColumnA ColumnB
1 {"alpha", "bravo"}
2 {"bravo", "charlie"}
3 {"alpha", "charlie","gama"}
4 {"bravo", "charlie","delta"}
Output dataframe
ColumnA alpha bravo charlie delta gamma
1 1 1 0 0 0
2 0 1 0 0 0
3 1 0 1 0 1
4 0 1 1 1 0
edit: I should have mentioned that the data is read from csv file. Column B is is of type str. It contains many strings inside. it looks {"value1" , "value2"...."valueN"} here values could be of the format "XYZ" or "X Y Z" or "X_Y_Z". The example i have provided is very simplistic version
CodePudding user response:
Assuming you really have string representations of sets, first remove the {"
/"}
and internal ", "
, then use str.get_dummies
:
out = (df.drop(columns='ColumnB')
.join(df['ColumnB'].str.replace('^{"|"(,)\s*"|"}$', r'\1', regex=True)
.str.get_dummies(sep=',')
)
)
Output:
ColumnA alpha bravo charlie delta gama
0 1 1 1 0 0 0
1 2 0 1 1 0 0
2 3 1 0 1 0 1
3 4 0 1 1 1 0
Alternative
similar to @PaulS's approach, but with ast.literal_eval
:
from ast import literal_eval
s = df['ColumnB'].apply(literal_eval).explode()
out = df.drop(columns='ColumnB').join(pd.crosstab(s.index, s))
CodePudding user response:
Another possible solution:
aux = df['ColumnB'].str.extractall(r'"(.*?)"').droplevel(1)
pd.concat([df['ColumnA'], pd.crosstab(aux.index, aux[0])], axis=1)
Output:
ColumnA alpha bravo charlie delta gama
0 1 1 1 0 0 0
1 2 0 1 1 0 0
2 3 1 0 1 0 1
3 4 0 1 1 1 0