I have two data frames, the first one has 200k records and the second one has 9k. I need to apply fuzzy matching for string matching in two columns. I dropped the duplicate values in both data frames, but still, there might be similar strings. Hence, I wrote the below code. I thought that I can manually go through the best-two-matches in the third column and see if it is a reasonable match or not. The issue is that the code has been running for the last 2 days and I do not know how to speed up this process. I would appreciate if you could share your opinion to improve the speed and performance.
Thanks. The mock data:
index | comp | var |
---|---|---|
0 | google inc-radio automation bu | apple |
1 | apple inc | |
2 | google inc | microsoft |
3 | applebee | google inc |
4 | microsoft | appplebee |
the code I wrote:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas
# empty lists for storing the
# matches later
mat1 = []
mat2 = []
list2=df['comp'].tolist()
list1= df['var'].tolist()
# taking the threshold as 90
threshold = 90
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extract(i, list2, limit=2))
df['matches'] = mat1
CodePudding user response:
There is a package rapidfuzz by @maxbachmann
pip install rapidfuzz
Sample usage:
from rapidfuzz import process, utils
df['matches'] = df['var'].apply(lambda x: process.extract(x, df['comp'], limit=2))