Home > OS >  Catastrophic backtracking issue with regular expression on long names
Catastrophic backtracking issue with regular expression on long names

Time:04-11

I am currently trying to validate a regex pattern for a names list among other things. It actually works so far except when I try to test the limits. If the name is quite long, a maximum of 128 characters is allowed and then at the end a character which is defined in an inner group, such as:. a separator e.g. Space or a puncture, catastrophic backtracking occurs. Somehow I don't quite understand that because I would assume that group one (?:[\p{L}\p{Nd}\p{Ps}]) 1 x must be there, group (?:\p{Zs}\p{P}|\p{P}\p{Zs}|[\p{P}\p{Zs}])? is optional and if the group has to be valid at the end (?:[\p{L}\p{Nd}\p{Pe}.]). The rear 2 groups can occur more often.

Full pattern

^(?!.{129})(?!.["])(?:[\p{L}\p{Nd}\p{Ps}]) (?:(?:\p{Zs}\p{P}|\p{P}\p{Zs}|[\p{P}\p{Zs}])?(?:[\p{L}\p{Nd}\p{Pe}.]))*$

Tests & Samples

https://regex101.com/r/6E0Khd/1

CodePudding user response:

You need to re-phrase the pattern in such a way so that the consequent regex parts could not match at the same location inside the string.

You can use

^(?!.{129})(?!.")[\p{L}\p{Nd}\p{Ps}][\p{L}\p{Nd}\p{Pe}.]*(?:(?:\p{Zs}\p{P}?|\p{P}\p{Zs}?)[\p{L}\p{Nd}\p{Pe}.] )*$

See the regex demo.

Your regex was ^<Lookahead_1><Lookahead_2><P_I> (?:<OPT_SEP>?<P_II>)*$. You need to make sure your string only starts with a char that matches <P_I> pattern, the rest of the chars can match <P_II> pattern. So, it should look like ^<Lookahead_1><Lookahead_2><P_I><P_II>*(?:<SEP><P_II> )*$. Note the P_I pattern is used to match the first char only, P_II pattern is added right after P_I to match zero or more chars matching that pattern, SEP pattern is now obligatory and P_II pattern is quantified with .

I also shrunk the (?:\p{Zs}\p{P}|\p{P}\p{Zs}|[\p{P}\p{Zs}]) pattern into (?:\p{Zs}\p{P}?|\p{P}\p{Zs}?) (it matches either a horizontal whitespace and an optional punctuation proper symbol, or an optional punctuation proper symbol followed with an optional horizontal whitespace.

Note that \p{Zs} does not match a TAB char, you may want to use [\p{Zs}\t] instead.

  • Related