I have the following two tables df1
:
id | description |
---|---|
1 | Foo |
2 | Bar |
and df2
:
description | category |
---|---|
Foo | cat1 |
Barrista | cat2 |
I now want to merge the dfs based on the "description" column if the string in df2
contains the string of df1
.
Right now I could only make it work with exact matches but not for contains
cases:
df3 = df1.merge(df2, on='description', how='left', suffixes=('_1', '_2'))
which returns
id | description | category |
---|---|---|
1 | Foo | cat1 |
2 | Bar |
but the desired output df3
should look like:
id | description | category |
---|---|---|
1 | Foo | cat1 |
2 | Bar | cat2 |
CodePudding user response:
You can craft a regex to extract the patterns in df1, then merge
:
import re
s = df1['description'].sort_values(key=lambda s: s.str.len(), ascending=False)
regex = '|'.join(map(re.escape, s))
# 'Foo|Bar'
df3 = df1.merge(df2, left_on='description',
right_on=df2['description'].str.extract(fr'({regex})',
expand=False),
how='left', suffixes=('_1', '_2'))
NB. I am sorting the words of df1 by decreasing length to ensure that a longer match would be used in case of multiple matches.
output:
description id description_1 description_2 category
0 Foo 1 Foo Foo cat1
1 Bar 2 Bar Barrista cat2