I have two dataframes (~10k row each) and I want to find the best match for each entry in the other.
Specifically, here is an example:
For each entry in df1 col user
, I want to find the best fuzzy match in the corresponding col in df2. Then I want to include the location of the matching entry and the matched entry in the final dataframe.
import numpy as np
np.random.seed(123)
df1 = pd.DataFrame({'user': ["aparna", "pankaj", "sudhir", "Geeku"],
'location': np.random.choice( [5,7,3], 4)})
df2 = pd.DataFrame({'user': ["aparn", "arup", "Pankaj", "sudhir c", "Geek", "abc"],
'location': np.random.choice( [5,7,3], 6)})
Each dataframe look like this:
user location
0 aparn 5
1 arup 3
2 Pankaj 3
3 sudhir c 7
4 Geek 3
5 abc 7
And the final result look like this
matching_user location1 matched_user location
0 aparn 5 aparna 7
1 pankaj 3 Pankaj 5
2 sudhir 7 sudhir c 7
...
CodePudding user response:
You can use thefuzz.process.extractOne
:
# pip install thefuzz
from thefuzz import process
df1.merge(df2, left_on=df1['user'].apply(lambda x: process.extractOne(x, df2['user'])[0]),
right_on='user',
suffixes=('_1', '_2')
).drop(columns='user')
output:
user_1 location_1 user_2 location_2
0 aparna 3 aparn 5
1 pankaj 7 Pankaj 3
2 sudhir 3 sudhir c 7
3 Geeku 3 Geek 3