Home > Net >  Substring Merge on a column with special characters
Substring Merge on a column with special characters

Time:12-31

I want to merge two dfs over 'Name' columns. However, one of them can be substring or exactly equal to other column. For this, I have used

df1 = pd.DataFrame(
    {
        'Name': ['12,5mg/0,333ml(37,5mg/ml)', 'ad', 'aaa'],
    }
)

df2 = pd.DataFrame(
    {
        'Name': ['12,5mg/0,333ml(37,5mg/ml)', 'ad', 'aaaa'],
    }
)

str_match = "({})".format("|".join(df1.Name))
df2.merge(df1, 
          left_on=df2.Name.str.extract(str_match)[0], 
          right_on="Name", how='outer')

'ad', 'aaa', 'aaaa' values are merged correctly. However, the problem occurs on the value '12,5mg/0,333ml(37,5mg/ml)' (most probably because of special characters). What I got with this code snippet

I have tried re.espace() but it did not help me. How can I solve this problem?

CodePudding user response:

You need to escape any special character. I would also keep only distinct values of the escaped strings (to simplify the regex if there are many repeats) and compile the pattern beforehand:

pattern = re.compile('({})'.format(
    '|'.join(pd.unique(df1.Name.apply(re.escape)))
))
out = df2.merge(
    df1, 
    left_on=df2.Name.str.extract(pattern)[0],
    right_on="Name", how='outer')

>>> out
  Name                       Name_x                     Name_y                    
0  12,5mg/0,333ml(37,5mg/ml)  12,5mg/0,333ml(37,5mg/ml)  12,5mg/0,333ml(37,5mg/ml)
1                         ad                         ad                         ad
2                        aaa                       aaaa                        aaa

CodePudding user response:

I think what you want to do is:

str_match = "({})".format("|".join(df1.Name))
df1.merge(df2['Name'], how='outer',
          left_on=df1['Name'].str.extract(str_match, expand=False)[0],
          right_on=df2['Name'].str.extract(str_match, expand=False)[0],
         ).drop(columns='key_0')

which gives:

                      Name_x                     Name_y
0  12,5mg/0,333ml(37,5mg/ml)  12,5mg/0,333ml(37,5mg/ml)
1                         ad                         ad
2                        aaa                       aaaa
  • Related