Fuzzywuzzy - copy info associated with a row from one df to another using a match-CodePudding

My goal is to match address information based on 2 separate data frames. One data frame contains a unique value and the other dataframe does not. Id like to take the unique key from df1 and copy it to df2 based on how similar the fuzzy match is.

Here's an example:

df1 = 

index  address_df1           unique key     calls       Name            Sales

1    123 nice road           Uniquekey1      11       jim bob             8
2    150  spring drive       Uniquekey2      151      jane doe            8213
3    240 happy lane          Uniquekey3      71       michael scott       909
4    80 sad parkway          Uniquekey4      1586     tracey jackson      109
5    122 big lane            Uniquekey5      161      lulu buzz           99
6    315 small pk            Uniquekey6      586      tinker bell         11
7    13  round rd ste 10     Uniquekey7      8601     jack ryan          681
8    97  square rd           Uniquekey8      66       peter paul         61968

df2 (*note address column in different place) =

index   cost center        country         address_df2  

1         1111              us              123 nice rd 
2         1111              us              97  square rd 
3         1112              us              13  round rd
4         1112              us              150  spring dr

I'd like the final data frame to look like this:

RESULT 
df3 (with unique key) =

index   cost center(df2)      country(df2)      address_df2         **unique key(from df1)**   fuzzy match %

1         1111                  us              123 nice rd            Uniquekey1               90%
2         1111                  us              97  square rd          Uniquekey8               90%
3         1112                  us              13  round rd           Uniquekey7               90%
4         1112                  us              150  spring dr         Uniquekey2               90%

I've tried:

from fuzzywuzzy import process

THRESHOLD = 90

best_match = \
    df2['address_df2'].apply(lambda x: process.extractOne(x, df1['address_df1'],
                                                      score_cutoff=THRESHOLD))

Im able to find the matches using this code and its awesome! However, when I go to merge the two dataframes I cant get the address to match up. I think the data coming in from df1 isnt sorting or matching on the address together.

I've tried this code (below) to match the 2 dfs but again, the address's dont align. So what ends up happening is the unique id are incorrect.


df3 = pd.merge(df2, df1.set_index(best_match.apply(pd.Series)[2]),
               left_index=True, right_index=True, how='left')

CodePudding user response：

Try this:

df3 = pd.concat([df2, best_match.apply(pd.Series).drop(2, axis=1)], axis=1).rename({0:'unique key', 1:'fuzzy match %'}, axis=1)

Output:

>>> df3
   index  cost-center country    address_df2          unique key  fuzzy match %
0      1         1111      us    123-nice-rd       123-nice-road             92
1      2         1111      us   97-square-rd        97-square-rd            100
2      3         1112      us    13-round-rd  13-round-rd-ste-10             90
3      4         1112      us  150-spring-dr    150-spring-drive             90

CodePudding user response：

The third items of the tuple returned by extractOne is the index label of the best match row of df1. So you can use loc to select the unique key column from df1.

# Prefer use thefuzz package
from thefuzz import process

THRESHOLD = 90

best_match = lambda x: process.extractOne(x, df1['address_df1'],
                                          score_cutoff=THRESHOLD)[2]
#                       Index label of best match row from df1 ---^

df2['unique key'] = df1['unique key'].loc[df2['address_df2'].apply(best_match)] \
                                     .values

Output:

>>> df2
   cost center country    address_df2  unique key
1         1111      us    123 nice rd  Uniquekey1
2         1111      us   97 square rd  Uniquekey8
3         1112      us    13 round rd  Uniquekey7
4         1112      us  150 spring dr  Uniquekey2