I have 2 pandas data frames:
df1 = pd.DataFrame({'keyword': ['Sox','Sox','Jays','D', 'Jays'],
'val':[1,2,3,4,5]})
df2 = pd.DataFrame({'name': ['a b c', 'Sox Red', 'Blue Jays White Sox'],
'city':[f'city-{i}' for i in [1,2,3]],
'info': [5, 6, 7]})
>>> df1
keyword val
0 Sox 1
1 Sox 2
2 Jays 3
3 D 4
4 Jays 5
>>> df2
name city info
0 a b c city-1 5
1 Sox Red city-2 6
2 Blue Jays White Sox city-3 7
For each row of df1
the merge should be taking the exact element of df1['keyword']
and see if it is present in each of the df2['name']
elements (e.g. using .str.contains
). Now there can be the following options:
- if it is present in exactly 1 row of
df2['name']
: match the current row ofdf1
with this 1 row ofdf2
. - otherwise (if it is present in more than 1 or in 0 rows of
df2['name']
): don't match the current row ofdf1
with anything - the values will beNaN
.
The result should look like this:
keyword val name city info
0 Sox 1 NaN NaN NaN
1 Sox 2 NaN NaN NaN
2 Jays 3 Blue Jays city-3 7.0
3 D 4 NaN NaN NaN
4 Jays 5 Blue Jays city-3 7.0
Here in the column "keyword"
:
"Sox"
matches multiples lines ofdf2
(lines 1 and 2), so its merged withNaN
s,"D"
matches 0 lines ofdf2
, so it's also merged withNaN
s,"Jays"
matches exactly 1 line indf2
(line 2), so it's merged with this line.
How to do this using pandas?
CodePudding user response:
Use a regex and str.extractall
to extract the keywords, remove the duplicates with drop_duplicates
, and finally merge
:
import re
pattern = '|'.join(map(re.escape, df1['keyword']))
# 'Sox|Sox|Jays|D|Jays
key = (df2['name'].str.extractall(fr'\b({pattern})\b')[0]
.droplevel('match')
.drop_duplicates(keep=False)
)
out = df1.merge(df2.assign(keyword=key),
on='keyword', how='left')
print(out)
NB. I'm assuming you want to match full words only, if not remove the word boundaries (\b
).
Output:
keyword val name city info
0 Sox 1 NaN NaN NaN
1 Sox 2 NaN NaN NaN
2 Jays 3 Blue Jays White Sox city-3 7.0
3 D 4 NaN NaN NaN
4 Jays 5 Blue Jays White Sox city-3 7.0
CodePudding user response:
One way to do this is to use a combination of .apply()
and .str.contains()
to find the rows in df2
that match the rows in df1
. Then, we can use .merge()
to merge the resulting data frames:
def merge_dfs(row):
keyword = row['keyword']
df2_match = df2[df2['name'].str.contains(keyword)]
return df2_match.iloc[0] if len(df2_match) == 1 else pd.Series(dtype='float64')
result = df1.apply(merge_dfs, axis=1).reset_index(drop=True)
result = df1.merge(result, left_index=True, right_index=True, how='left')
This should give the desired result:
>>> result
keyword val city info name
0 Sox 1 NaN NaN NaN
1 Sox 2 NaN NaN NaN
2 Jays 3 city-3 7.0 Blue Jays White Sox
3 D 4 NaN NaN NaN
4 Jays 5 city-3 7.0 Blue Jays White Sox