Splitting a string column according to its values creating new columns-CodePudding

I have a dataframe which looks somthing like this:

df
name     age      symptoms
x        18       fever, headache, cough
y        23       cough
z        89       HTN
a        43       TNT, HTN

I want to create a new dataframe which looks something like this, depending on the symptoms column.

df2
name     age      symptoms                     fever      headache      cough    HTN   TNT
x        18       fever, headache, cough        1          1             1       0      0
y        23       cough                         0          0             1       0      0
z        89       HTN                           0          0             0       1      0
a        43       TNT, HTN                      0          0             0       1      1

So it actually creates a new column for every new value found.

I tried it using:

split_ = df["symptoms"].str.split(expand=True)

but how do I create new columns and add the 0 and 1 value in the dataframe.

CodePudding user response：

Try pd.DataFrame.join with str.get_dummies with sep:

df.join(df['symptoms'].str.get_dummies(sep=', '))

Output:

  name  age                symptoms  HTN  TNT  cough  fever  headache
0    x   18  fever, headache, cough    0    0      1      1         1
1    y   23                   cough    0    0      1      0         0
2    z   89                     HTN    1    0      0      0         0
3    a   43                TNT, HTN    1    1      0      0         0

pd.Series.str.get_dummies has the sep argument, where you could specify the separator you want for get_dummies.

CodePudding user response：

You could use a combination of pandas.get_dummies and join:

df.join(pd.get_dummies(df['symptoms'].str.split(', ').explode()).groupby(level=0).max())

output:

  name  age                symptoms  HTN  TNT  cough  fever  headache
0    x   18  fever, headache, cough    0    0      1      1         1
1    y   23                   cough    0    0      1      0         0
2    z   89                     HTN    1    0      0      0         0
3    a   43                TNT, HTN    1    1      0      0         0

CodePudding user response：

df.join(pd.get_dummies(df['symptoms']))

   name  age                symptoms  HTN  TNT, HTN  cough  \
0    x   18  fever, headache, cough    0         0      0   
1    y   23                   cough    0         0      1   
2    z   89                     HTN    1         0      0   
3    a   43                TNT, HTN    0         1      0   

   fever, headache, cough  
0                       1  
1                       0  
2                       0  
3                       0

You could also crosstab

pd.crosstab([df.name, df.age], df.symptoms).reset_index()

symptoms name  age  HTN  TNT, HTN  cough  fever, headache, cough
0           a   43    0         1      0                       0
1           x   18    0         0      0                       1
2           y   23    0         0      1                       0
3           z   89    1         0      0                       0