My goal is to match address information based on 2 separate data frames. One data frame contains a unique value and the other dataframe does not. Id like to take the unique key from df1 and copy it to df2 based on how similar the fuzzy match is.
Here's an example:
df1 =
index address_df1 unique key calls Name Sales
1 123 nice road Uniquekey1 11 jim bob 8
2 150 spring drive Uniquekey2 151 jane doe 8213
3 240 happy lane Uniquekey3 71 michael scott 909
4 80 sad parkway Uniquekey4 1586 tracey jackson 109
5 122 big lane Uniquekey5 161 lulu buzz 99
6 315 small pk Uniquekey6 586 tinker bell 11
7 13 round rd ste 10 Uniquekey7 8601 jack ryan 681
8 97 square rd Uniquekey8 66 peter paul 61968
df2 (*note address column in different place) =
index cost center country address_df2
1 1111 us 123 nice rd
2 1111 us 97 square rd
3 1112 us 13 round rd
4 1112 us 150 spring dr
I'd like the final data frame to look like this:
RESULT
df3 (with unique key) =
index cost center(df2) country(df2) address_df2 **unique key(from df1)** fuzzy match %
1 1111 us 123 nice rd Uniquekey1 90%
2 1111 us 97 square rd Uniquekey8 90%
3 1112 us 13 round rd Uniquekey7 90%
4 1112 us 150 spring dr Uniquekey2 90%
I've tried:
from fuzzywuzzy import process
THRESHOLD = 90
best_match = \
df2['address_df2'].apply(lambda x: process.extractOne(x, df1['address_df1'],
score_cutoff=THRESHOLD))
Im able to find the matches using this code and its awesome! However, when I go to merge the two dataframes I cant get the address to match up. I think the data coming in from df1 isnt sorting or matching on the address together.
I've tried this code (below) to match the 2 dfs but again, the address's dont align. So what ends up happening is the unique id are incorrect.
df3 = pd.merge(df2, df1.set_index(best_match.apply(pd.Series)[2]),
left_index=True, right_index=True, how='left')
CodePudding user response:
Try this:
df3 = pd.concat([df2, best_match.apply(pd.Series).drop(2, axis=1)], axis=1).rename({0:'unique key', 1:'fuzzy match %'}, axis=1)
Output:
>>> df3
index cost-center country address_df2 unique key fuzzy match %
0 1 1111 us 123-nice-rd 123-nice-road 92
1 2 1111 us 97-square-rd 97-square-rd 100
2 3 1112 us 13-round-rd 13-round-rd-ste-10 90
3 4 1112 us 150-spring-dr 150-spring-drive 90
CodePudding user response:
The third items of the tuple returned by extractOne
is the index label of the best match row of df1
. So you can use loc
to select the unique key
column from df1
.
# Prefer use thefuzz package
from thefuzz import process
THRESHOLD = 90
best_match = lambda x: process.extractOne(x, df1['address_df1'],
score_cutoff=THRESHOLD)[2]
# Index label of best match row from df1 ---^
df2['unique key'] = df1['unique key'].loc[df2['address_df2'].apply(best_match)] \
.values
Output:
>>> df2
cost center country address_df2 unique key
1 1111 us 123 nice rd Uniquekey1
2 1111 us 97 square rd Uniquekey8
3 1112 us 13 round rd Uniquekey7
4 1112 us 150 spring dr Uniquekey2