Home > Net >  checking if a string subset is in column values from a list and assign value in a new column in pand
checking if a string subset is in column values from a list and assign value in a new column in pand

Time:04-01

I have list procedure_name = [brain, spine, abdomen, thorax]

actual list is longer.

I have a df where one of the columns is ['procedure'] has free text values (as its entered by human operator)

procedure
brain and spine
BRAIN SPINE
Thorax limited cuts
Abdomen contrast
Thorax Abdomen

and so on..

I want to check if each value in the column has one or more matching terms from the list(procedure_name) and add those terms to new columns ['Anatomy1']['Anatomy2'] ['Anatomy3'] for a max of 3 matches.

Expected output

procedure Anatomy1 Anatomy2 Anatomy3
brain and spine brain spine none
BRAIN SPINE brain spine none
Thorax limited cuts thorax none none
Abdomen contrast abdomen none none
Thorax Abdomen thorax abdomen none

I tried my best to explain the problem as logically as possible Thanks in advance

CodePudding user response:

You could use str.extractall and a custom regex, then unstack and join:

import re
MAX = 3
procedure_name = ['brain', 'spine', 'abdomen', 'thorax']
regex = '|'.join(procedure_name)

df2 = df.join(df['procedure']
 .str.lower() # optional, to ensure output is lowercase
 .str.extractall(f'({regex})', flags=re.I)[0]
 .unstack('match')
 .reindex(columns=range(MAX)) # optional (see below)
 # alternative to reindex if you
 # don't want to create empty columns
 #.iloc[:, :MAX]
 .rename(columns=lambda x: f'Anatomy{x 1}')
 )

Output:

             procedure Anatomy1 Anatomy2  Anatomy3
0      brain and spine    brain    spine       NaN
1          BRAIN SPINE    brain    spine       NaN
2  Thorax limited cuts   thorax      NaN       NaN
3     Abdomen contrast  abdomen      NaN       NaN
4     Thorax   Abdomen   thorax  abdomen       NaN

CodePudding user response:

Here's one way using str.get_dummies

def get_values(x):
    v = x.dropna().to_numpy().tolist()
    if len(v) >=3:
        return v[:3]
    else:
        return v   [float('nan')] * (3 - len(v))

tmp = df['procedure'].str.replace(' ', ' ', regex=False).str.lower().str.get_dummies(' ')[procedure_name]
out = (pd.DataFrame(tmp.mul(tmp.columns).replace('', pd.NA)
                    .apply(get_values, axis=1).tolist(), 
                    columns=['Anatomy1', 'Anatomy2', 'Anatomy3']))

Output:

  Anatomy1 Anatomy2  Anatomy3
0    brain    spine       NaN
1    brain    spine       NaN
2   thorax      NaN       NaN
3  abdomen      NaN       NaN
4  abdomen   thorax       NaN

CodePudding user response:

Check with findall

import re
df = df.join(df.procedure.str.findall('|'.join(procedure_name),flags=re.IGNORECASE).apply(pd.Series).reindex(columns = [0,1,2]))
  • Related