Apply fuzzy ratio to two dataframes-CodePudding

I have two dataframes where I want to fuzzy string compare & apply my function to two dataframes:

sample1 = pd.DataFrame(data1.sample(n=200, random_state=42))
sample2 = pd.DataFrame(data2.sample(n=200, random_state=13))

def get_ratio(row):
    sample1 = row['address']
    sample2 = row['address']
    return fuzz.token_set_ratio(sample1, sample2)
match = data[data.apply(get_ratio, axis=1) >= 78] #I want to apply get_ratio to both sample1 and sample2
no_matched = data[data.apply(get_ratio, axis=1) <= 77] #I want to apply get_ratio to both sample1 and sample2

Thanks in advance for your help!

CodePudding user response：

You need to create the permutations of your addresses. Then use that to compare the matching ones. You can find a similar question here.

For your case first you need to create permutations:

combs = list(itertools.product(data1["address"], data2["address"]))
combs = pd.DataFrame(combs)

Then use the proper method for matching:

combs['score'] = combs.apply(lambda x: fuzz.token_set_ratio(x[0],x[1]), axis=1)

now based on the score you can find the ones that have matched or have not matched.

I advise you do try to group and clean the addresses first (i.e., lowering the case, removing the duplicates) Otherwise it might take a very long time to compute.