Python : count string in column dataframe that belong to a list-CodePudding

I spent a day trying to solve my problem...

I have a DataFrame that I import from CSV file. Here an example:

df=pd.DataFrame(['{"choices": ["rougeur", "hematome","oedeme","ecoul","necrose"]}','ecoul','necrose','','oedeme'])

I have my list of my possible labels:

label_sl=['rougeur', 'hematome', 'oedeme','ecoul','extra','necrose']

I would like to create a new dataframe that returns:

rougeur hematome oedeme ecoul extra necrose
1 1 1 1 0 1
1 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 0
0 0 1 0 0 0

I don't find the solution... If you have an idea...

Thanks,

CodePudding user response：

If all your values including your dictionary are actually strings, this should work:

(df[0].str.replace(r'[\[\]{}"]','',regex=True)
.str.strip()
.str.split('[, ]')
.explode()
.str.get_dummies()
.groupby(level=0).sum()
.reindex(label_sl,axis=1)
.fillna(0)
.astype(int))

Output:

   rougeur  hematome  oedeme  ecoul  extra  necrose
0        1         1       1      1      0        1
1        0         0       0      1      0        0
2        0         0       0      0      0        1
3        0         0       0      0      0        0
4        0         0       1      0      0        0

CodePudding user response：

Regular expression \bsomething\b extracts something as a separate word. We can use it like this:

for x in label_sl:
    df[x] = df.iloc[:,0].str.contains("\\b"   x   "\\b").astype(int)

where

label_sl=['rougeur', 'hematome', 'oedeme','ecoul','extra','necrose']
df=pd.DataFrame(['{"choices": ["rougeur", "hematome","oedeme","ecoul","necrose"]}','ecoul','necrose','','oedeme'])