Home > Software engineering >  Convert a categorical valued column to its statistical values in Python
Convert a categorical valued column to its statistical values in Python

Time:09-23

I have a dataframe whose sample is given below.

import pandas as pd

data = {'ID':['A','B','C','D','E','F'],
'Gender':['Man', 'Woman', 'Transgender', 'Non-binary,Transgender', 'Woman,Non-binary', 
'Man,Non-binary,Transgender']}

df = pd.DataFrame(data)
df

enter image description here

Now, I want to create a column for each value in the 'Gender' column and if the value is present in the row, the new column should have '1' else empty. The final form required is shown below.

enter image description here

Cannot use pd.get_dummies() as there are multiple values(ex: 'Non-binary, Transgender') in many rows. I thought of manually hardcoding for all values, but wanted to know if there is a way to automate the process. Any help is greatly appreciated. Thanks.

CodePudding user response:

Use Series.str.get_dummies, which allows you to specify a separator in the case of multiple values in a string, then join the result back.

pd.concat([df, df['Gender'].str.get_dummies(',').add_prefix('Gender_')], axis=1)

  ID                      Gender  Gender_Man  Gender_Non-binary  Gender_Transgender  Gender_Woman
0  A                         Man           1                  0                   0             0
1  B                       Woman           0                  0                   0             1
2  C                 Transgender           0                  0                   1             0
3  D      Non-binary,Transgender           0                  1                   1             0
4  E            Woman,Non-binary           0                  1                   0             1
5  F  Man,Non-binary,Transgender           1                  1                   1             0

CodePudding user response:

Well you can split on , to easily come back to a situation where you can use get_dummies:

>>> df_split = df[['ID']].join(df['Gender'].str.split(',')).explode('Gender')
>>> df_split
  ID       Gender
0  A          Man
1  B        Woman
2  C  Transgender
3  D   Non-binary
3  D  Transgender
4  E        Woman
4  E   Non-binary
5  F          Man
5  F   Non-binary
5  F  Transgender
>>> dummies = pd.get_dummies(df_split['Gender']).groupby(df_split['ID']).max().reset_index()
>>> dummies
  ID  Man  Non-binary  Transgender  Woman
0  A    1           0            0      0
1  B    0           0            0      1
2  C    0           0            1      0
3  D    0           1            1      0
4  E    0           1            0      1
5  F    1           1            1      0 
>>> df.merge(dummies, on='ID')
  ID                      Gender  Man  Non-binary  Transgender  Woman
0  A                         Man    1           0            0      0
1  B                       Woman    0           0            0      1
2  C                 Transgender    0           0            1      0
3  D      Non-binary,Transgender    0           1            1      0
4  E            Woman,Non-binary    0           1            0      1
5  F  Man,Non-binary,Transgender    1           1            1      0
  • Related