Home > front end >  How would I merge two dataframes using a "contains" condition?
How would I merge two dataframes using a "contains" condition?

Time:09-21

I have the following two tables df1:

id description
1 Foo
2 Bar

and df2:

description category
Foo cat1
Barrista cat2

I now want to merge the dfs based on the "description" column if the string in df2 contains the string of df1.

Right now I could only make it work with exact matches but not for contains cases:

df3 = df1.merge(df2, on='description', how='left', suffixes=('_1', '_2'))

which returns

id description category
1 Foo cat1
2 Bar

but the desired output df3 should look like:

id description category
1 Foo cat1
2 Bar cat2

CodePudding user response:

You can craft a regex to extract the patterns in df1, then merge:

import re
s = df1['description'].sort_values(key=lambda s: s.str.len(), ascending=False)
regex = '|'.join(map(re.escape, s))
# 'Foo|Bar'

df3 = df1.merge(df2, left_on='description',
                right_on=df2['description'].str.extract(fr'({regex})',
                                                        expand=False),
                how='left', suffixes=('_1', '_2'))

NB. I am sorting the words of df1 by decreasing length to ensure that a longer match would be used in case of multiple matches.

output:

  description  id description_1 description_2 category
0         Foo   1           Foo           Foo     cat1
1         Bar   2           Bar      Barrista     cat2
  • Related