Home > other >  python regex BESTMATCH behaviour
python regex BESTMATCH behaviour

Time:12-22

I am seeing a behaviour of the python regex library for fuzzy matching that I do not quite understand.

The string I am searching into is:

s="TACGGTTTTAACTTTGCAAGCTTCAGAAGGGATTACTAGCAGTAAAAATGCGGAAATTTCTCTTTATGATGGCGCCACGCTCAATTTGGCTTCAAACAGCGTTAAATTAATGGGTAATGTCAAG"

and the patterns I am searching for are (5 mismatches allowed):

for match in regex.finditer("(CAAGCTTCAGAAGGGATCACTAGCGATAAA|GGCTTCAAGCAGCGTTAAATTAATGGGTAATGT|AATTTCTCTTTATGAT){s<=5}", s):
    print(match)

My three patterns are found, as expected, using the command above:

<regex.Match object; span=(16, 46), match='CAAGCTTCAGAAGGGATTACTAGCAGTAAA', fuzzy_counts=(3, 0, 0)>
<regex.Match object; span=(54, 70), match='AATTTCTCTTTATGAT'>
<regex.Match object; span=(87, 120), match='GGCTTCAAACAGCGTTAAATTAATGGGTAATGT', fuzzy_counts=(1, 0, 0)>

However if I ask for the best match, the first pattern (starting CAAGCTT) is not found anymore:

for match in regex.finditer("(?b)(CAAGCTTCAGAAGGGATCACTAGCGATAAA|GGCTTCAAGCAGCGTTAAATTAATGGGTAATGT|AATTTCTCTTTATGAT){s<=5}", s):
    print(match)
<regex.Match object; span=(54, 70), match='AATTTCTCTTTATGAT'>
<regex.Match object; span=(87, 120), match='GGCTTCAAACAGCGTTAAATTAATGGGTAATGT', fuzzy_counts=(1, 0, 0)>

How can I explain this, as my patterns are not overlapping and, if I search for them separately, using bestmatch, they are found indeed!?

EDIT

If I modify my string so that the first pattern only has 1 mismatch left (doesn't work with 2), and remove the third pattern from my search, my first pattern will be found along with the second:

s="TACGGTTTTAACTTTGCAAGCTTCAGAAGGGATCACTAGCGGTAAAAATGCGGAAATTTCTCTTTATGATGGCGCCACGCTCAATTTGGCTTCAAACAGCGTTAAATTAATGGGTAATGTCAAG"
for match in regex.finditer("(?b)(CAAGCTTCAGAAGGGATCACTAGCGATAAA|GGCTTCAAGCAGCGTTAAATTAATGGGTAATGT){s<=5}", s):
   print(match)
<regex.Match object; span=(16, 46), match='CAAGCTTCAGAAGGGATCACTAGCGGTAAA', fuzzy_counts=(1, 0, 0)>
<regex.Match object; span=(87, 120), match='GGCTTCAAACAGCGTTAAATTAATGGGTAATGT', fuzzy_counts=(1, 0, 0)>

Is there an unwritten rule of regex provided no more than two best matches, with no more than 1 mismatch?

CodePudding user response:

The (?b) flag in the regular expression indicates that the regex engine should try to find the best match among the patterns provided. When using this flag, the regex engine will choose the match with the fewest number of errors (in this case, mismatches).

In your example, the first pattern CAAGCTTCAGAAGGGATCACTAGCGATAAA has 3 mismatches, while the second and third patterns have 0 and 1 mismatches, respectively. Therefore, when the (?b) flag is used, the regex engine will choose the second and third patterns as the best matches, and the first pattern will be ignored.

If you want to find all the matches for your patterns, regardless of the number of mismatches, you can remove the (?b) flag from the regular expression. This will allow the regex engine to find all the patterns that match with up to 5 mismatches.

CodePudding user response:

I can't replicate your output. When running:

for match in regex.finditer("(CAAGCTTCAGAAGGGATCACTAGCGATAAA|GGCTTCAAGCAGCGTTAAATTAATGGGTAATGT|AATTTCTCTTTATG){s<=5}", s):
    print(match)

I retrieve:

<regex.Match object; span=(16, 46), match='CAAGCTTCAGAAGGGATTACTAGCAGTAAA', fuzzy_counts=(3, 0, 0)>
<regex.Match object; span=(54, 68), match='AATTTCTCTTTATG'>
<regex.Match object; span=(82, 96), match='AATTTGGCTTCAAA', fuzzy_counts=(5, 0, 0)>

Or to have a bit more insight:

enter image description here

If I compare this to your output I can see that the last match has an overlapping position with your 3rd match. Since there is no overlapping the 3rd result in my table above should come first and cancel out your 3rd match.

That being said, if read carefully I think you want to retrieve three matches? The best match for each of your sub-patterns. I don't think a (non)capture group will actually do what you are after. Instead run over a given list of subpatterns:

import regex as re
s="TACGGTTTTAACTTTGCAAGCTTCAGAAGGGATTACTAGCAGTAAAAATGCGGAAATTTCTCTTTATGATGGCGCCACGCTCAATTTGGCTTCAAACAGCGTTAAATTAATGGGTAATGTCAAG"
r_in = ['CAAGCTTCAGAAGGGATCACTAGCGATAAA', 'GGCTTCAAGCAGCGTTAAATTAATGGGTAATGT', 'AATTTCTCTTTATG']
r_new = [mtch[0] if bool(mtch) else '' for mtch in [re.search(fr'(?b)({el}){{s<=5:[ACGT]}}', s) for el in r_in]]
print(r_new)

Gives:

['CAAGCTTCAGAAGGGATTACTAGCAGTAAA', 'GGCTTCAAACAGCGTTAAATTAATGGGTAATGT', 'AATTTCTCTTTATG']

Note that I also edited the 'fuzziness' to be more explicit using {s<=5:[ACGT]}.

  • Related