improving fuzzy matching performance-CodePudding

I have two data frames, the first one has 200k records and the second one has 9k. I need to apply fuzzy matching for string matching in two columns. I dropped the duplicate values in both data frames, but still, there might be similar strings. Hence, I wrote the below code. I thought that I can manually go through the best-two-matches in the third column and see if it is a reasonable match or not. The issue is that the code has been running for the last 2 days and I do not know how to speed up this process. I would appreciate if you could share your opinion to improve the speed and performance.

Thanks. The mock data:

index	comp	var
0	google inc-radio automation bu	apple
1	apple inc	google
2	google inc	microsoft
3	applebee	google inc
4	microsoft	appplebee

the code I wrote:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas

# empty lists for storing the
# matches later
mat1 = []
mat2 = []
list2=df['comp'].tolist()
list1= df['var'].tolist()
# taking the threshold as 90
threshold = 90


# iterating through list1 to extract
# it's closest match from list2
for i in list1:
    mat1.append(process.extract(i, list2, limit=2))
df['matches'] = mat1

CodePudding user response：

There is a package rapidfuzz by @maxbachmann

pip install rapidfuzz

Sample usage:

from rapidfuzz import process, utils
df['matches'] = df['var'].apply(lambda x: process.extract(x, df['comp'], limit=2))