How to find partial word matches in 2 pandas dataframes?-CodePudding

Sorry if this is a similar question, I tried to find one that could answer my specific use-case but I only found ones that give exact matches between dataframes.

I have 2 pandas data frames of descriptions:

df1:

Description
i had lunch
going to the airport
buying a suitcase

df2:

Description
buying lunch
airport travel
owning a car

I'd like to filter and/or count how many times df2 has a matching word that appears in any row of df1

so for example df2 has the words 'lunch' and 'airport' and those single words appear in df1, so I would like to pull out and count the rows in df2 that have that match.

So my output is just a filtered df2 based on single word matches in df1.

Example output filtered would be:

df2:

Description
buying lunch
airport travel

Is there a way I can do this with pandas dataframes?

Example input data:

d = {'Description': ['i had lunch', 'going to the airport', 'buying a suitcase']}
df1 = pd.DataFrame(data=d)

d = {'Description': ['buying lunch', 'airport travel', 'owning a car']}
df2 = pd.DataFrame(data=d)

CodePudding user response：

globally

To match globally (any row of df1 with any row of df2), you can use:

import re
regex = '|'.join(map(re.escape, df1['Description'].str.split().explode().unique()))

out = df2[df2['Description'].str.contains(fr'\b({regex})\b')]

To count:

df2['Description'].str.contains(fr'\b({regex})\b').sum()

Output: 3

per row

If you want to match each pair of row:

matches = [bool(set(a.split())&set(b.split()))
           for a,b in zip(df1['Description'].str.lower(),
                          df2['Description'].str.lower())]

output: [True, True, True]

To count:

count = sum(bool(set(a.split())&set(b.split()))
            for a,b in zip(df1['Description'].str.lower(),
                          df2['Description'].str.lower())
            )

Output: 3

excluding small words

blacklist = {'a'}

matches = [bool(set(a.split())&set(b.split())-blacklist)
           for a,b in zip(df1['Description'].str.lower(),
                          df2['Description'].str.lower())]

output: [True, True, False]

CodePudding user response：

so i think you should look into the pandas split method.

i'm not too sure how exactly to implement this but in reference to this post something along the lines of this might work:

result = df2.loc[df2['Description'].isin(df1['Description'].apply(pd.Series).stack().reset_index(drop=True).tolist())

basically you split the strings of df1 into single words and convert them into a list. then you can check where a string of df2 contains any word of df1 with the .isin() method.

hope this helps :)