Home > OS >  Python regex prefers longer fuzzy match to shorter exact match
Python regex prefers longer fuzzy match to shorter exact match

Time:10-05

I am using regex in Python to search for multiple patterns in a string. A simplified example would be as follows:

import regex
s = "vrhvydhvkzejjvksdlstringvhehvehvurejlcslvdk"  #string to look into
p = ['(?P<string>string)', '(?P<longtext>longtext)']  #patterns to search for
r = regex.compile('(?b)('   " | ".join(p)   '){s<=3}')  #regex, allowing for 3 mismatches, bestmatch to be reported
r.search(s)   #searching for patterns p in string s
<regex.Match object; span=(18, 25), match='stringv', fuzzy_counts=(1, 0, 0)>   #search results

My expected result would be:

<regex.Match object; span=(18, 24), match='string', fuzzy_counts=(0, 0, 0)>

Why do regex reports a fuzzy match stringv with 1 mismatch instead of reporting the exact match string? And how do I need to modify my code to get to my expected results?

I am with Python-3.7.3 and regex 2.5.115

CodePudding user response:

The '(?e)(' " | ".join(p) '){s<=3}' results in a (?e)((?P<string>string) | (?P<longtext>longtext)){s<=3} regex, see the spaces around |. Since v is substituted for a space when matching the (?P<string>string) regex part, you get stringv as a match.

You need

r = regex.compile('(?b)('   "|".join(p)   '){s<=3}')  #regex, allowing for 3 mismatches, bestmatch to be reported

See the Python demo:

import regex
s = "vrhvydhvkzejjvksdlstringvhehvehvurejlcslvdk"  #string to look into
p = ['(?P<string>string)', '(?P<longtext>longtext)']  #patterns to search for
rx = '(?e)('   "|".join(p)   '){s<=3}' 
r = regex.compile(rx)  #regex, allowing for 3 mismatches, bestmatch to be reported
print( r.search(s) )
# => <regex.Match object; span=(18, 24), match='string'>
  • Related