Home > Blockchain >  Why does this regular expression sometimes get stuck and freeze my program? what alternative could i
Why does this regular expression sometimes get stuck and freeze my program? what alternative could i

Time:02-16

import re

input_text_to_check = str(input()) #Input

regex_patron_m1 = r"\s*((?:\w \s*) ) \s*\¿?(?:would not be what |would not be that |would not be that |would not be the |would not be this |would not be the |would not be some)\s*((?:\w \s*) )\s*\??"
m1 = re.search(regex_patron_m1, input_text_to_check, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code

#Validation
if m1:
    word, association = m1.groups()
    word = word.strip()
    association = association.strip()

    print(repr(word))
    print(repr(association))

I think that although the regex is somewhat long, for a modern PC it should not be too much work to validate that 10 or 20 options in the (?: | | | | ) That's why I thought that the problem could be in first \s*((?:\w \s*) ) \s* and/or in the last \s*((?:\w \s*) )\s*

The following is an example of an input that causes the regular expression got stuck:

"the blue skate would not be that product that you want buy now"

And this is an example where it doesn't crash: "the blue skate would not be that product"

And give me the words that I want extract:

'the blue skate'
'product'

Is there an alternative to be able to extract what is in front of and behind those options? and that it does not crash sometimes? what could be the reason of the problem with this regex that I made?

CodePudding user response:

Based on this explenation of 'Catastrophic Backtracking' I think the issue with your regex is the following:

The thing you try to match with ((?:\w \s*) ) can be matched in multiple ways. Let's say you use ((?:\w \s*) ) on the input string abc. This can be matched in many ways:

  • (a and 0 whitespaces)(b and 0 whitespaces)(c and 0 whitespaces)
  • (a and 0 whitespaces)(bc and 0 whitespaces)
  • (ab and 0 whitespaces)(c and 0 whitespaces)

As long as you only need to match ((?:\w \s*) ) this works fine. But when you add something else afterwards (like the 10 or so options in your case) regex needs to do some heavy recusion. Have a look at the provided link for a better explanation.

Removing the after both the \w results in a working regex for the two cases provided:


"\s*((?:\w\s*) ) \s*\¿?(?:would not be what |would not be that |would not be that |would not be the |would not be this |would not be the |would not be some)\s*((?:\w\s*) )\s*\??"gm

Does this work on your device and for all your test cases?

  • Related