Home > OS >  Regex giving infinite loopback instead of ending
Regex giving infinite loopback instead of ending

Time:02-04

Could anyone explain how my MySQL regex script would loop infinitely instead of returning true or false?

'^[[:alnum:]] ([_\\.\\-]?[[:alnum:]] )*@[[:alnum:]] ([_\\.\\-]?[[:alnum:]] )*(\\.[[:alnum:]]{2,4}) $'

Is there a way to make MySQL return false if it detects an infinite loop?

My entire sql query is:

SELECT
    cusworkemail NOT REGEXP '^[[:alnum:]] ([_\\.\\-]?[[:alnum:]] )*@[[:alnum:]] ([_\\.\\-]?[[:alnum:]] )*(\\.[[:alnum:]]{2,4}) $' AS invalid_value,
    cusworkemail,
    num,
    cusid_list
FROM (
    SELECT
        IFNULL(cusworkemail, '') AS cusworkemail,
        count(*) AS num,
        GROUP_CONCAT(DISTINCT cusid) AS cusid_list
    FROM (SELECT cusid, cusworkemail FROM dealCRM.cus WHERE cusworkemail != '' AND cusworkemail IS NOT NULL) AS t
    GROUP BY cusworkemail
    ORDER BY num DESC
    -- LIMIT 0, 10000
    ) AS c
HAVING invalid_value;

The query will throw an error on regex timeout.

Here is an example of an email that will cause an infinite loop:

"[email protected] " with a space at the end.

Does the parser not detect that it is repeating the same internal state?

CodePudding user response:

The pattern [[:alnum:]] ([_\\.\\-]?[[:alnum:]] )* is potentially causing a lot of backtracking. It is not an infinite loop, just a lot of looping which will hit a limit.

Take for instance this string which should not match: "abcdefg"

It will first match [[:alnum:]] against "abcdefg", then execute ()* zero times and then find that there is no "@". So it backtracks making [[:alnum:]] to match one fewer character: "abcdef". The ()* part can now execute once. But it needs to backtrack again. We can summarise this as:

abcdefg()
abcdef(g)
abcde(fg)
abcd(efg)
abc(defg)
ab(cdefg)
a(bcdefg)
 

None of this divisions of the characters into the first or second [[:alnum:]] pattern helps to get past that @. In the example you have given this backtracking will occur at the end so that there is a Cartesian product of backtracking in the part before the @ and in the part after it.

Conclusion: you should remove that ?. It is not an optional match. The optional part is already reflected by the surrounding *. Either there is a hyphen/dot/underscore and the pattern in parentheses should be executed, or there is not, and then that pattern should not be executed.

Note you have this pattern twice, so two corrections are needed.

  • Related