Pandas convert a column containing strings into new columns-CodePudding

I have a dataframe with columns that contains comma separated strings. I would like to create new columns similar to what one hot encoding does.

Below is a very simplistic example. In my use case, I have thousands of rows with more columns, and two columns containing comma separated many strings. I could have used apply lamda function string contains condition to create each column but that is very tedious as it will be 100s of new columns

Input Datafarme

ColumnA    ColumnB
 1         {"alpha", "bravo"} 
 2         {"bravo", "charlie"}
 3         {"alpha", "charlie","gama"}
 4         {"bravo", "charlie","delta"}

Output dataframe

ColumnA    alpha     bravo   charlie delta gamma  
 1         1         1       0       0     0       
 2         0         1       0       0     0
 3         1         0       1       0     1
 4         0         1       1       1     0

edit: I should have mentioned that the data is read from csv file. Column B is is of type str. It contains many strings inside. it looks {"value1" , "value2"...."valueN"} here values could be of the format "XYZ" or "X Y Z" or "X_Y_Z". The example i have provided is very simplistic version

CodePudding user response：

Assuming you really have string representations of sets, first remove the {"/"} and internal ", ", then use str.get_dummies:

out = (df.drop(columns='ColumnB')
         .join(df['ColumnB'].str.replace('^{"|"(,)\s*"|"}$', r'\1', regex=True)
                            .str.get_dummies(sep=',')
               )
        )

Output:

   ColumnA  alpha  bravo  charlie  delta  gama
0        1      1      1        0      0     0
1        2      0      1        1      0     0
2        3      1      0        1      0     1
3        4      0      1        1      1     0

Alternative

similar to @PaulS's approach, but with ast.literal_eval:

from ast import literal_eval
s = df['ColumnB'].apply(literal_eval).explode()
out = df.drop(columns='ColumnB').join(pd.crosstab(s.index, s))

CodePudding user response：

Another possible solution:

aux = df['ColumnB'].str.extractall(r'"(.*?)"').droplevel(1)
pd.concat([df['ColumnA'], pd.crosstab(aux.index, aux[0])], axis=1)

Output:

   ColumnA  alpha  bravo  charlie  delta  gama
0        1      1      1        0      0     0
1        2      0      1        1      0     0
2        3      1      0        1      0     1
3        4      0      1        1      1     0