Pandas extract substring from column of string-CodePudding

I have the following dataframes

df1:
Name
0   AAa
1   BBB
2   ccc

df2:
Description
0   text AAa clinic text
1   text bbb hospital text

I want to add another column to df2 that extracts Name from Description, but the Name has to be followed by either 'clinic' or 'hospital'. So if Description is "text AAa text", I don't want AAa to be extracted

I feel like this should be straightforward but for some reason I am stuck and can't find a solution

I have tried the following but it returns df2['Extracted Name'] all None

def df_matcher(x):
    for i in df1['Name']:
        if ((i.lower()   " clinic" in x.lower()) or (i.lower()   " hospital" in x.lower())):
            return i

df2['Extracted Name'] = df2['Description'].apply(df_matcher)

Thanks!

CodePudding user response：

Let's use str.extract with a regex pattern:

pat = r'(?i)\b(%s)\s(?:clinic|hospital)' % '|'.join(df1.Name)
df2['col'] = df2['Description'].str.extract(pat)

              Description  col
0    text AAa clinic text  AAa
1  text bbb hospital text  bbb

CodePudding user response：

If the part you want to remove appears exactly before clinic or hospital, you don't need the first dataframe and you can try this:

df2['split'] = df2.Description.str.split()
df2['index'] = df2['split'].apply(lambda x:x.index('clinic') if 'clinic' in x else x.index('hospital'))

df2['remove'] = df2.apply(lambda x:x['spilit'][x['index'] - 1], axis=1)
df2['Extracted Name'] = df2.apply(lambda x:x['Description'].replace(x['remove'], ''), axis=1)