I have two dataframes where I want to fuzzy string compare & apply my function to two dataframes:
sample1 = pd.DataFrame(data1.sample(n=200, random_state=42))
sample2 = pd.DataFrame(data2.sample(n=200, random_state=13))
def get_ratio(row):
sample1 = row['address']
sample2 = row['address']
return fuzz.token_set_ratio(sample1, sample2)
match = data[data.apply(get_ratio, axis=1) >= 78] #I want to apply get_ratio to both sample1 and sample2
no_matched = data[data.apply(get_ratio, axis=1) <= 77] #I want to apply get_ratio to both sample1 and sample2
Thanks in advance for your help!
CodePudding user response:
You need to create the permutations of your addresses. Then use that to compare the matching ones. You can find a similar question here.
For your case first you need to create permutations:
combs = list(itertools.product(data1["address"], data2["address"]))
combs = pd.DataFrame(combs)
Then use the proper method for matching:
combs['score'] = combs.apply(lambda x: fuzz.token_set_ratio(x[0],x[1]), axis=1)
now based on the score you can find the ones that have matched or have not matched.
I advise you do try to group and clean the addresses first (i.e., lowering the case, removing the duplicates) Otherwise it might take a very long time to compute.