Python match user input to element in list-CodePudding

I am trying to match user input to an element in the list.
The goal is to allow the user to not type the whole name for the element because there are elements longer than 30 characters:

the result must contain all the characters in the input

Ex: user input foobar matches with
- foobarxx
but not
- fobar
extra characters between inputted keywords are allowed

Ex: user input abc matches with:
- abc, a bc, axxbxxxc
the most relevant result is selected

Ex: apple pie matches with:
- apple tasty pie party, app legit piece, aXpXpXlXeX XpXiXe
However I only want the most relevant result, which is apple tasty pie party

Code

I have somehow achieved (1) and (2) using:

enter = input("input: ")
all_element = ["orange", "apple pie", "pine apple pie", "ppap", "pen pineapple apple pen"]
pattern = ('(?:. )?'.join(list(enter))).replace(" ", r"\s")
print(pattern)
results = {}
for full_name in all_element:
    all = re.findall(pattern, full_name)
    if all:
        results[len(max(all))] = full_name
print(results)
print(f"result: {results[max(results)]}\n")

result:

input: pen apple
p(?:. )?e(?:. )?n(?:. )?\s(?:. )?a(?:. )?p(?:. )?p(?:. )?l(?:. )?e
{22: 'pen pineapple apple pen'}
result: pen pineapple apple pen

input: ora
o(?:. )?r(?:. )?a
{3: 'orange'}
result: orange

I am currently trying to solve (3)

Based on the example in (3), My plan is to look at how many breaks happens, I know that: - "apple tasty pie party" breaked 1 times, by the word " tasty" - "app legit pieece" breaked 2 times, one space and one "git" - aXpXpXlXeX XpXiXe breaked n(X) number of times

The result with the least breaked times is selected, which is apple tasty pie party

From the code above I am just using the length of the matched element to select the result, which is inaccurate, since ppap results in pen pineapple apple pen instead of ppap itself:

input: ppap
p(?:. )?p(?:. )?a(?:. )?p
{4: 'ppap', 21: 'pen pineapple apple pen'}
result: pen pineapple apple pen

So I am wondering how could I get the number of breaked times based on (?:. )?, where

result should be:

{0: 'ppap', 2: 'pen pineapple apple pen'}

with the key is number of breaked times, and item as the selected element

such that I can simply use a min() to get the most relevant result

The question is how, do i need to write my own function or is there any regex pattern that can handle this

CodePudding user response：

You can use the Counter class from the collections module to quickly eliminate elements that do not match the first two criteria. Then you can use the SequenceMatcher from the difflib to select the most relevant choice among the remaining elements.

import difflib
from collections import Counter

enter = input("input: ")
all_elements = ["orange", "apple pie", "pine apple pie", "ppap", "pen pineapple apple pen"]
cnts = Counter(enter)
for k,v in cnts.items():
    start = len(all_elements) - 1
    while start >= 0:
        if k not in all_elements[start] or all_elements[start].count(k) < v:
            del all_elements[start]
        start -= 1

results = {}
for elem in all_elements:
    matcher = difflib.SequenceMatcher(a=enter, b=elem)
    num = matcher.quick_ratio()
    blocks = len(matcher.get_matching_blocks())
    results[elem] = (num, blocks)

min_blocks = min([i[1] for i in results.values()])
min_elems = {k:v[0] for k,v in results.items() if v[1] == min_blocks}
print(max(min_elems, key=lambda x: min_elems[x]))

CodePudding user response：

another possible way of comparing similar strings is using the Levenshtein module.

from Levenshtein import distance as lev

st = "apple pie"
l = ['apple tasty pie party', 'app legit piece', 'aXpXpXlXeX XpXiXe']


def find_best_match(some_list, st):

    return max((lev(st,s), s) for s in some_list)[1]


find_best_match(l, st)

apple tasty pie party