How to make fuzzy search between lists showing matches and not found elements?-CodePudding

I'm trying to make a fuzzy match for the values in list to_search. Search each value in to_search within choices list and show the corresponding item from result list. Like a MS Excel VLookUp, but with fuzzy search.

This is my current code that almost print the correct output for different values, but for the values of to_search that don't have any similarity within choices, I´d like to show in output Not Found, but currently I'm getting some other output.

The search I´m looking for is by prefix, this is in the same order. For example the value 38050 appears with this match ('358', 72.0, 8) because 3, 5 and 8 are present in 38050, but for me is not of interest since 3, 5 and 8 are in different order. Would be a match for me if the choice found would be 380XX, that at least has similarity in prefix compared with 38050. I hope you understand my explanation.

from rapidfuzz import process, fuzz

choices = [
        '237','1721','124622','334','124624','124','1246','1876','358',
        '33751','33679','599','61','230','31','65','1721','1','124623'
    ]

result = [
            'NAD','ATE','STA','SSI','GYP','RIC','EEC','AND','GIU','ANC',
            'PAI','GAR','TAL','ANI','LAN','TRI','GDO','MAR','EDE'
        ]

to_search = ['18763044','187635','23092','3162','38050','33','49185','51078','1246','1721']

for element in to_search:
    match =  process.extractOne(element, choices, scorer=fuzz.WRatio)
    print(element,result[match[2]],'         ## ',match)

Current output

>>>
18763044    AND         ##  ('1876', 90.0, 7)
187635      AND         ##  ('1876', 90.0, 7)
23092       ANI         ##  ('230', 90.0, 13)
3162        LAN         ##  ('31', 90.0, 14)
38050       GIU         ##  ('358', 72.0, 8) // This should be marked as NOT FOUND
33          SSI         ##  ('334', 90.0, 3)
49185       MAR         ##  ('1', 90.0, 17)  // This should be marked as NOT FOUND
51078       MAR         ##  ('1', 90.0, 17)  // This should be marked as NOT FOUND
1246        EEC         ##  ('1246', 100.0, 6)
1721        ATE         ##  ('1721', 100.0, 1)

The output I'm trying to get:

18763044    AND
187635      AND
23092       ANI
3162        LAN
38050       NOT FOUND
33          SSI
49185       NOT FOUND
51078       NOT FOUND
1246        EEC
1721        ATE

In table format for easy understanding of inputs and output. Thanks in advance

CodePudding user response：

You can choose a scorer that will take order more into account than WRatio. Then set score_cutoff to exclude results below a given similarity.

For your example, fuzz.ratio with score_cutoff=60 seems to work. You'll have to test on bigger datasets and try different scorers to know what you need exactly:

from rapidfuzz import process, fuzz

choices = [
        '237','1721','124622','334','124624','124','1246','1876','358',
        '33751','33679','599','61','230','31','65','1721','1','124623'
    ]

result = [
            'NAD','ATE','STA','SSI','GYP','RIC','EEC','AND','GIU','ANC',
            'PAI','GAR','TAL','ANI','LAN','TRI','GDO','MAR','EDE'
        ]

to_search = ['18763044','187635','23092','3162','38050','33','49185','51078','1246','1721']

for element in to_search:
    match =  process.extractOne(element, choices, scorer=fuzz.ratio, score_cutoff=60)
    if match:
        print(element,result[match[2]],'         ## ',match)
    else:
        print(element,"NOT FOUND")

Output:

18763044 AND          ##  ('1876', 66.66666666666667, 7)
187635 AND          ##  ('1876', 80.0, 7)
23092 ANI          ##  ('230', 75.0, 13)
3162 LAN          ##  ('31', 66.66666666666667, 14)
38050 NOT FOUND
33 SSI          ##  ('334', 80.0, 3)
49185 NOT FOUND
51078 NOT FOUND
1246 EEC          ##  ('1246', 100.0, 6)
1721 ATE          ##  ('1721', 100.0, 1)

CodePudding user response：

You can create you own scorer:

def my_scorer(query,choice,**kwargs):
    # default score
    score=fuzz.WRatio(query,choice)

    is_prefix=False

    # if choice is a prefix of query
    if choice in query and query.index(choice)==0:
        is_prefix=True

    # if query is a prefix of choice
    if query in choice and choice.index(query)==0:
        is_prefix=True

    if(not is_prefix):
        score=-1

    return score

And pass it to process.extractOne along with score_cutoff=0 to ignore results with score lower than 0:

match = process.extractOne(element, choices, scorer=my_scorer, score_cutoff=0) 
if(match):
  print(element,result[match[2]])
else:
  print(element,'NOT FOUND')