import re
input_text_to_check = str(input()) #Input
regex_patron_m1 = r"\s*((?:\w \s*) ) \s*\¿?(?:would not be what |would not be that |would not be that |would not be the |would not be this |would not be the |would not be some)\s*((?:\w \s*) )\s*\??"
m1 = re.search(regex_patron_m1, input_text_to_check, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code
#Validation
if m1:
word, association = m1.groups()
word = word.strip()
association = association.strip()
print(repr(word))
print(repr(association))
I think that although the regex is somewhat long, for a modern PC it should not be too much work to validate that 10 or 20 options in the (?: | | | | )
That's why I thought that the problem could be in first \s*((?:\w \s*) ) \s*
and/or in the last \s*((?:\w \s*) )\s*
The following is an example of an input that causes the regular expression got stuck:
"the blue skate would not be that product that you want buy now"
And this is an example where it doesn't crash:
"the blue skate would not be that product"
And give me the words that I want extract:
'the blue skate'
'product'
Is there an alternative to be able to extract what is in front of and behind those options? and that it does not crash sometimes? what could be the reason of the problem with this regex that I made?
CodePudding user response:
Based on this explenation of 'Catastrophic Backtracking' I think the issue with your regex is the following:
The thing you try to match with ((?:\w \s*) )
can be matched in multiple ways. Let's say you use ((?:\w \s*) )
on the input string abc
.
This can be matched in many ways:
- (
a
and0
whitespaces)(b
and0
whitespaces)(c
and0
whitespaces) - (
a
and0
whitespaces)(bc
and0
whitespaces) - (
ab
and0
whitespaces)(c
and0
whitespaces)
As long as you only need to match ((?:\w \s*) )
this works fine. But when you add something else afterwards (like the 10 or so options in your case) regex needs to do some heavy recusion. Have a look at the provided link for a better explanation.
Removing the
after both the \w
results in a working regex for the two cases provided:
"\s*((?:\w\s*) ) \s*\¿?(?:would not be what |would not be that |would not be that |would not be the |would not be this |would not be the |would not be some)\s*((?:\w\s*) )\s*\??"gm
Does this work on your device and for all your test cases?