I am trying to match user input to an element in the list.
The goal is to allow the user to not type the whole name for the element because there are elements longer than 30 characters:
the result must contain all the characters in the input
Ex: user input
foobar
matches withfoobarxx
but not
fobar
extra characters between inputted keywords are allowed
Ex: user input
abc
matches with:abc
,a bc
,axxbxxxc
the most relevant result is selected
Ex:
apple pie
matches with:apple tasty pie party
,app legit piece
,aXpXpXlXeX XpXiXe
However I only want the most relevant result, which is
apple tasty pie party
Code
I have somehow achieved (1) and (2) using:
enter = input("input: ")
all_element = ["orange", "apple pie", "pine apple pie", "ppap", "pen pineapple apple pen"]
pattern = ('(?:. )?'.join(list(enter))).replace(" ", r"\s")
print(pattern)
results = {}
for full_name in all_element:
all = re.findall(pattern, full_name)
if all:
results[len(max(all))] = full_name
print(results)
print(f"result: {results[max(results)]}\n")
result:
input: pen apple
p(?:. )?e(?:. )?n(?:. )?\s(?:. )?a(?:. )?p(?:. )?p(?:. )?l(?:. )?e
{22: 'pen pineapple apple pen'}
result: pen pineapple apple pen
input: ora
o(?:. )?r(?:. )?a
{3: 'orange'}
result: orange
I am currently trying to solve (3)
Based on the example in (3), My plan is to look at how many breaks happens, I know that:
- "apple
tasty pie
party" breaked 1 times, by the word " tasty"
- "app
le
git pie
ece" breaked 2 times, one space and one "git"
- a
Xp
Xp
Xl
Xe
X
Xp
Xi
Xe
breaked n(X) number of times
The result with the least breaked times is selected, which is apple tasty pie party
From the code above I am just using the length of the matched element to select the result, which is inaccurate, since ppap
results in pen pineapple apple pen
instead of ppap
itself:
input: ppap
p(?:. )?p(?:. )?a(?:. )?p
{4: 'ppap', 21: 'pen pineapple apple pen'}
result: pen pineapple apple pen
So I am wondering how could I get the number of breaked times based on (?:. )?
, where
result
should be:
{0: 'ppap', 2: 'pen pineapple apple pen'}
with the key is number of breaked times, and item as the selected element
such that I can simply use a min()
to get the most relevant result
The question is how, do i need to write my own function or is there any regex pattern that can handle this
CodePudding user response:
You can use the Counter
class from the collections
module to quickly eliminate elements that do not match the first two criteria. Then you can use the SequenceMatcher
from the difflib
to select the most relevant choice among the remaining elements.
import difflib
from collections import Counter
enter = input("input: ")
all_elements = ["orange", "apple pie", "pine apple pie", "ppap", "pen pineapple apple pen"]
cnts = Counter(enter)
for k,v in cnts.items():
start = len(all_elements) - 1
while start >= 0:
if k not in all_elements[start] or all_elements[start].count(k) < v:
del all_elements[start]
start -= 1
results = {}
for elem in all_elements:
matcher = difflib.SequenceMatcher(a=enter, b=elem)
num = matcher.quick_ratio()
blocks = len(matcher.get_matching_blocks())
results[elem] = (num, blocks)
min_blocks = min([i[1] for i in results.values()])
min_elems = {k:v[0] for k,v in results.items() if v[1] == min_blocks}
print(max(min_elems, key=lambda x: min_elems[x]))
CodePudding user response:
another possible way of comparing similar strings is using the Levenshtein module.
from Levenshtein import distance as lev
st = "apple pie"
l = ['apple tasty pie party', 'app legit piece', 'aXpXpXlXeX XpXiXe']
def find_best_match(some_list, st):
return max((lev(st,s), s) for s in some_list)[1]
find_best_match(l, st)
apple tasty pie party