I spent a day trying to solve my problem...
I have a DataFrame that I import from CSV file. Here an example:
df=pd.DataFrame(['{"choices": ["rougeur", "hematome","oedeme","ecoul","necrose"]}','ecoul','necrose','','oedeme'])
I have my list of my possible labels:
label_sl=['rougeur', 'hematome', 'oedeme','ecoul','extra','necrose']
I would like to create a new dataframe that returns:
rougeur hematome oedeme ecoul extra necrose
1 1 1 1 0 1
1 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 0
0 0 1 0 0 0
I don't find the solution... If you have an idea...
Thanks,
AL
CodePudding user response:
If all your values including your dictionary are actually strings, this should work:
(df[0].str.replace(r'[\[\]{}"]','',regex=True)
.str.strip()
.str.split('[, ]')
.explode()
.str.get_dummies()
.groupby(level=0).sum()
.reindex(label_sl,axis=1)
.fillna(0)
.astype(int))
Output:
rougeur hematome oedeme ecoul extra necrose
0 1 1 1 1 0 1
1 0 0 0 1 0 0
2 0 0 0 0 0 1
3 0 0 0 0 0 0
4 0 0 1 0 0 0
CodePudding user response:
Regular expression \bsomething\b
extracts something
as a separate word. We can use it like this:
for x in label_sl:
df[x] = df.iloc[:,0].str.contains("\\b" x "\\b").astype(int)
where
label_sl=['rougeur', 'hematome', 'oedeme','ecoul','extra','necrose']
df=pd.DataFrame(['{"choices": ["rougeur", "hematome","oedeme","ecoul","necrose"]}','ecoul','necrose','','oedeme'])