python dataframe fuzzy match and verification strategies-CodePudding

I have two dataframes (~10k row each) and I want to find the best match for each entry in the other.

Specifically, here is an example: For each entry in df1 col user, I want to find the best fuzzy match in the corresponding col in df2. Then I want to include the location of the matching entry and the matched entry in the final dataframe.

import numpy as np
np.random.seed(123)
df1 = pd.DataFrame({'user': ["aparna", "pankaj", "sudhir", "Geeku"], 
                   'location': np.random.choice( [5,7,3], 4)})
df2 = pd.DataFrame({'user': ["aparn", "arup", "Pankaj", "sudhir c", "Geek", "abc"], 
                   'location': np.random.choice( [5,7,3], 6)})

Each dataframe look like this:

      user  location
0     aparn         5
1      arup         3
2    Pankaj         3
3  sudhir c         7
4      Geek         3
5       abc         7

And the final result look like this

      matching_user  location1 matched_user location
0     aparn         5     aparna           7
1    pankaj         3     Pankaj           5
2    sudhir         7     sudhir c         7
...

CodePudding user response：

You can use thefuzz.process.extractOne:

# pip install thefuzz
from thefuzz import process

df1.merge(df2, left_on=df1['user'].apply(lambda x: process.extractOne(x, df2['user'])[0]),
          right_on='user',
          suffixes=('_1', '_2')
         ).drop(columns='user')

output:

   user_1  location_1    user_2  location_2
0  aparna           3     aparn           5
1  pankaj           7    Pankaj           3
2  sudhir           3  sudhir c           7
3   Geeku           3      Geek           3