Sorry if this is a similar question, I tried to find one that could answer my specific use-case but I only found ones that give exact matches between dataframes.
I have 2 pandas data frames of descriptions:
df1:
Description
i had lunch
going to the airport
buying a suitcase
df2:
Description
buying lunch
airport travel
owning a car
I'd like to filter and/or count how many times df2
has a matching word that appears in any row of df1
so for example df2
has the words 'lunch' and 'airport' and those single words appear in df1
, so I would like to pull out and count the rows in df2
that have that match.
So my output is just a filtered df2
based on single word matches in df1
.
Example output filtered would be:
df2:
Description
buying lunch
airport travel
Is there a way I can do this with pandas dataframes?
Example input data:
d = {'Description': ['i had lunch', 'going to the airport', 'buying a suitcase']}
df1 = pd.DataFrame(data=d)
d = {'Description': ['buying lunch', 'airport travel', 'owning a car']}
df2 = pd.DataFrame(data=d)
CodePudding user response:
globally
To match globally (any row of df1 with any row of df2), you can use:
import re
regex = '|'.join(map(re.escape, df1['Description'].str.split().explode().unique()))
out = df2[df2['Description'].str.contains(fr'\b({regex})\b')]
To count:
df2['Description'].str.contains(fr'\b({regex})\b').sum()
Output: 3
per row
If you want to match each pair of row:
matches = [bool(set(a.split())&set(b.split()))
for a,b in zip(df1['Description'].str.lower(),
df2['Description'].str.lower())]
output: [True, True, True]
To count:
count = sum(bool(set(a.split())&set(b.split()))
for a,b in zip(df1['Description'].str.lower(),
df2['Description'].str.lower())
)
Output: 3
excluding small words
blacklist = {'a'}
matches = [bool(set(a.split())&set(b.split())-blacklist)
for a,b in zip(df1['Description'].str.lower(),
df2['Description'].str.lower())]
output: [True, True, False]
CodePudding user response:
so i think you should look into the pandas split method.
i'm not too sure how exactly to implement this but in reference to this post something along the lines of this might work:
result = df2.loc[df2['Description'].isin(df1['Description'].apply(pd.Series).stack().reset_index(drop=True).tolist())
basically you split the strings of df1 into single words and convert them into a list. then you can check where a string of df2 contains any word of df1 with the .isin() method.
hope this helps :)