I'm trying to make a fuzzy match for the values in list to_search
. Search each value in to_search
within
choices
list and show the corresponding item from result
list. Like a MS Excel VLookUp, but with fuzzy search.
This is my current code that almost print the correct output for different values, but for the values of to_search
that
don't have any similarity within choices
, I´d like to show in output Not Found
, but currently I'm getting some other output.
The search I´m looking for is by prefix, this is in the same order. For example the value 38050
appears with this match ('358', 72.0, 8)
because 3, 5 and 8 are present in 38050, but for me is not of interest since 3, 5 and 8 are in different order. Would be a match for me if the choice found would be 380XX, that at least has similarity in prefix compared with 38050. I hope you understand my explanation.
from rapidfuzz import process, fuzz
choices = [
'237','1721','124622','334','124624','124','1246','1876','358',
'33751','33679','599','61','230','31','65','1721','1','124623'
]
result = [
'NAD','ATE','STA','SSI','GYP','RIC','EEC','AND','GIU','ANC',
'PAI','GAR','TAL','ANI','LAN','TRI','GDO','MAR','EDE'
]
to_search = ['18763044','187635','23092','3162','38050','33','49185','51078','1246','1721']
for element in to_search:
match = process.extractOne(element, choices, scorer=fuzz.WRatio)
print(element,result[match[2]],' ## ',match)
Current output
>>>
18763044 AND ## ('1876', 90.0, 7)
187635 AND ## ('1876', 90.0, 7)
23092 ANI ## ('230', 90.0, 13)
3162 LAN ## ('31', 90.0, 14)
38050 GIU ## ('358', 72.0, 8) // This should be marked as NOT FOUND
33 SSI ## ('334', 90.0, 3)
49185 MAR ## ('1', 90.0, 17) // This should be marked as NOT FOUND
51078 MAR ## ('1', 90.0, 17) // This should be marked as NOT FOUND
1246 EEC ## ('1246', 100.0, 6)
1721 ATE ## ('1721', 100.0, 1)
The output I'm trying to get:
18763044 AND
187635 AND
23092 ANI
3162 LAN
38050 NOT FOUND
33 SSI
49185 NOT FOUND
51078 NOT FOUND
1246 EEC
1721 ATE
In table format for easy understanding of inputs and output. Thanks in advance
CodePudding user response:
You can choose a scorer that will take order more into account than WRatio
. Then set score_cutoff
to exclude results below a given similarity.
For your example, fuzz.ratio
with score_cutoff=60
seems to work. You'll have to test on bigger datasets and try different scorers to know what you need exactly:
from rapidfuzz import process, fuzz
choices = [
'237','1721','124622','334','124624','124','1246','1876','358',
'33751','33679','599','61','230','31','65','1721','1','124623'
]
result = [
'NAD','ATE','STA','SSI','GYP','RIC','EEC','AND','GIU','ANC',
'PAI','GAR','TAL','ANI','LAN','TRI','GDO','MAR','EDE'
]
to_search = ['18763044','187635','23092','3162','38050','33','49185','51078','1246','1721']
for element in to_search:
match = process.extractOne(element, choices, scorer=fuzz.ratio, score_cutoff=60)
if match:
print(element,result[match[2]],' ## ',match)
else:
print(element,"NOT FOUND")
Output:
18763044 AND ## ('1876', 66.66666666666667, 7)
187635 AND ## ('1876', 80.0, 7)
23092 ANI ## ('230', 75.0, 13)
3162 LAN ## ('31', 66.66666666666667, 14)
38050 NOT FOUND
33 SSI ## ('334', 80.0, 3)
49185 NOT FOUND
51078 NOT FOUND
1246 EEC ## ('1246', 100.0, 6)
1721 ATE ## ('1721', 100.0, 1)
CodePudding user response:
You can create you own scorer:
def my_scorer(query,choice,**kwargs):
# default score
score=fuzz.WRatio(query,choice)
is_prefix=False
# if choice is a prefix of query
if choice in query and query.index(choice)==0:
is_prefix=True
# if query is a prefix of choice
if query in choice and choice.index(query)==0:
is_prefix=True
if(not is_prefix):
score=-1
return score
And pass it to process.extractOne
along with score_cutoff=0
to ignore results with score lower than 0
:
match = process.extractOne(element, choices, scorer=my_scorer, score_cutoff=0)
if(match):
print(element,result[match[2]])
else:
print(element,'NOT FOUND')