I'm trying to find URLs in text using RegEx. I found a pattern here. This works well in most cases, but i found a weird test case that causes Catastrophic Backtracking error.
You can see my pattern and a test case here. In this case it works fine but if you add another "!" at the end, it gives you error.
I want to know the cause and how to fix it.
CodePudding user response:
If all you want is to know what to do to make the pattern safer and less catastrophic backtracking prone, you need to replace each (?:x |\(xxx\))*
pattern look like (?:\(xxx\)|x)*
. This greatly reduces the number of steps the regex engine takes.
So, in this case, you can use
(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-] [.][a-z]{2,4}/)(?:\((?:\([^\s()<>] \)|[^\s()<>])*\)|[^\s()<>]) (?:\((?:\([^\s()<>] \)|[^\s()<>])*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))
See the regex demo.
Basically, ([^\s()<>] |(\([^\s()<>] \)))*
is replaced with (?:\((?:\([^\s()<>] \)|[^\s()<>])*\)|[^\s()<>])
that matches
(?:
- start of a non-capturing group:\(
- a(
char(?:\([^\s()<>] \)|[^\s()<>])*
- zero or more occurrences of(
, any one or more chars other than whitespace,(
,)
,<
and>
, and then a)
or a single char other than whitespace,(
,)
,<
and>
\)
- a)
char|
- or[^\s()<>]
- any char other than whitespace,(
,)
,<
and>
)
- one or more occurrences.
CodePudding user response:
If you look above the flag section of your link(https://regex101.com/r/oPnn6n/1). You can see that it took 49,235 steps just to match one string!!
To know what causes backtracking here lets break down the expression:
Your expression can be written as:
(
(?: ~https?://~ | ~www\d{0,3}[.]~ | ~[a-z0-9.\-] [.][a-z]{2,4}/~ ) <-Line1 with 3 conditions
(?: ~[^\s()<>] ~ | ~\(([^\s()<>] |(\([^\s()<>] \)))*\)~ ) <-Line2 with 2 conditions
(?: ~\(([^\s()<>] |(\([^\s()<>] \)))*\)~ | ~[^\s`!()\[\]{};:\'".,<>?«»“”‘’]~ ) <-Line3 with 2 condition
)
Lets focus on the three lines I've indicated above and By conditions I mean the conditions separated by boolean OR |
, which have been put between ~
for readability(this wont with on any regex, its just for readability). So in group1 I'm calling https?://
, www\d{0,3}[.]
and [a-z0-9.\-] [.][a-z]{2,4}/
the three conditions, similarly for the next two lines.
Now lets see how your input: https://test.com/test!!!!!!!!!!!
is being matched by the regex engine.
https://
part is matched(consumed) by condition1 of line1 and since its a boolean OR, the other two conditions are skipped and regex engine moves over to line2.- Whole of
test.com/test!!!!!!!!!!!
is greedily matched(consumed) by condition 1 of line2. Now that all the string has been consumed backtracking starts, so that the regex engine can try matching something with line3. - Engine backtracks one step(uncosumes one
!
) and tries to match the found!
with line3, neither condition on line3 match!
. So the engine goes back to line 2 again. - Here the condition 1 matches the
!
again, and regex engine moves to line 3.
I assume that this repeats till some limit is reached and the engine throws up the error. I could be wrong here though but anyways I hope that you got an idea which part of the expression is causing backtracking.
The best way to avoid backtracking is to be precise about what you are going to match with regex. In your case, one thing that you could to is to simplify/remove repeated matching conditions. Like: https://regex101.com/r/I1u8Ho/1.