Say we have this df:
import pandas as pd
df = pd.DataFrame({'a': ['hair color other family, friends ', 'family, friends hair color']})
a
0 hair color other family, friends
1 family, friends hair color
I want to extract strings using my own list of items:
items = ['hair color', 'other', 'family, friends']
I want to do this because there are no consistent delimiter or pattern in the raw data.
Desired output:
import numpy as np
desired_output = pd.DataFrame({'a': ['hair color other family, friends ', 'family, friends hair color'],
'hair color': ['hair color', 'hair color'],
'other': ['other', np.nan],
'family, friends': ['family, friends', 'family, friends']
})
a hair color other family, friends
0 hair color other family, friends hair color other family, friends
1 family, friends hair color hair color NaN family, friends
CodePudding user response:
You can craft a regex to use with str.extractall
:
import re
regex = '|'.join([f'({re.escape(i)})' for i in items])
# '(hair\\ color)|(other)|(family,\\ friends)'
df.join(df['a'].str.extractall(regex)
.set_axis(items, axis=1)
.groupby(level=0).first())
output:
a hair color other family, friends
0 hair color other family, friends hair color other family, friends
1 family, friends hair color hair color None family, friends
update:
df.join(df['a'].str.extractall(regex)
.set_axis(items, axis=1)
.groupby(level=0).first()
.add_prefix('item1_')
.replace({None: np.nan})
)
output:
a item1_hair color item1_other item1_family, friends
0 hair color other family, friends hair color other family, friends
1 family, friends hair color hair color NaN family, friends