I want to merge two dfs over 'Name' columns. However, one of them can be substring or exactly equal to other column. For this, I have used
df1 = pd.DataFrame(
{
'Name': ['12,5mg/0,333ml(37,5mg/ml)', 'ad', 'aaa'],
}
)
df2 = pd.DataFrame(
{
'Name': ['12,5mg/0,333ml(37,5mg/ml)', 'ad', 'aaaa'],
}
)
str_match = "({})".format("|".join(df1.Name))
df2.merge(df1,
left_on=df2.Name.str.extract(str_match)[0],
right_on="Name", how='outer')
'ad', 'aaa', 'aaaa' values are merged correctly. However, the problem occurs on the value '12,5mg/0,333ml(37,5mg/ml)' (most probably because of special characters). What I got with this code snippet
I have tried re.espace() but it did not help me. How can I solve this problem?
CodePudding user response:
You need to escape any special character. I would also keep only distinct values of the escaped strings (to simplify the regex if there are many repeats) and compile the pattern beforehand:
pattern = re.compile('({})'.format(
'|'.join(pd.unique(df1.Name.apply(re.escape)))
))
out = df2.merge(
df1,
left_on=df2.Name.str.extract(pattern)[0],
right_on="Name", how='outer')
>>> out
Name Name_x Name_y
0 12,5mg/0,333ml(37,5mg/ml) 12,5mg/0,333ml(37,5mg/ml) 12,5mg/0,333ml(37,5mg/ml)
1 ad ad ad
2 aaa aaaa aaa
CodePudding user response:
I think what you want to do is:
str_match = "({})".format("|".join(df1.Name))
df1.merge(df2['Name'], how='outer',
left_on=df1['Name'].str.extract(str_match, expand=False)[0],
right_on=df2['Name'].str.extract(str_match, expand=False)[0],
).drop(columns='key_0')
which gives:
Name_x Name_y
0 12,5mg/0,333ml(37,5mg/ml) 12,5mg/0,333ml(37,5mg/ml)
1 ad ad
2 aaa aaaa