I have list procedure_name = [brain, spine, abdomen, thorax]
actual list is longer.
I have a df where one of the columns is ['procedure'] has free text values (as its entered by human operator)
procedure |
---|
brain and spine |
BRAIN SPINE |
Thorax limited cuts |
Abdomen contrast |
Thorax Abdomen |
and so on..
I want to check if each value in the column has one or more matching terms from the list(procedure_name) and add those terms to new columns ['Anatomy1']['Anatomy2'] ['Anatomy3'] for a max of 3 matches.
Expected output
procedure | Anatomy1 | Anatomy2 | Anatomy3 |
---|---|---|---|
brain and spine | brain | spine | none |
BRAIN SPINE | brain | spine | none |
Thorax limited cuts | thorax | none | none |
Abdomen contrast | abdomen | none | none |
Thorax Abdomen | thorax | abdomen | none |
I tried my best to explain the problem as logically as possible Thanks in advance
CodePudding user response:
You could use str.extractall
and a custom regex, then unstack
and join
:
import re
MAX = 3
procedure_name = ['brain', 'spine', 'abdomen', 'thorax']
regex = '|'.join(procedure_name)
df2 = df.join(df['procedure']
.str.lower() # optional, to ensure output is lowercase
.str.extractall(f'({regex})', flags=re.I)[0]
.unstack('match')
.reindex(columns=range(MAX)) # optional (see below)
# alternative to reindex if you
# don't want to create empty columns
#.iloc[:, :MAX]
.rename(columns=lambda x: f'Anatomy{x 1}')
)
Output:
procedure Anatomy1 Anatomy2 Anatomy3
0 brain and spine brain spine NaN
1 BRAIN SPINE brain spine NaN
2 Thorax limited cuts thorax NaN NaN
3 Abdomen contrast abdomen NaN NaN
4 Thorax Abdomen thorax abdomen NaN
CodePudding user response:
Here's one way using str.get_dummies
def get_values(x):
v = x.dropna().to_numpy().tolist()
if len(v) >=3:
return v[:3]
else:
return v [float('nan')] * (3 - len(v))
tmp = df['procedure'].str.replace(' ', ' ', regex=False).str.lower().str.get_dummies(' ')[procedure_name]
out = (pd.DataFrame(tmp.mul(tmp.columns).replace('', pd.NA)
.apply(get_values, axis=1).tolist(),
columns=['Anatomy1', 'Anatomy2', 'Anatomy3']))
Output:
Anatomy1 Anatomy2 Anatomy3
0 brain spine NaN
1 brain spine NaN
2 thorax NaN NaN
3 abdomen NaN NaN
4 abdomen thorax NaN
CodePudding user response:
Check with findall
import re
df = df.join(df.procedure.str.findall('|'.join(procedure_name),flags=re.IGNORECASE).apply(pd.Series).reindex(columns = [0,1,2]))