Home > Back-end >  Why are sequential regular expressions more efficient than a combined experession?
Why are sequential regular expressions more efficient than a combined experession?

Time:07-27

In answering a Splunk question on SO, the following sample text was given:

msg: abc.asia - [2021-08-23T00:27:08.152 0000] "GET /facts?factType=COMMERCIAL&sourceSystem=ADMIN&sourceOwner=ABC&filters=%7B%22stringMatchFilters%22:%5B%7B%22key%22:%22BFEESCE((json_data-%3E%3E'isNotSearchable')::boolean,%20false)%22,%22value%22:%22false%22,%22operator%22:%22EQ%22%7D%5D,%22multiStringMatchFilters%22:%5B%7B%22key%22:%22json_data-%3E%3E'id'%22,%22values%22:%5B%224970111%22%5D%7D%5D,%22containmentFilters%22:%5B%5D,%22nestedMultiStringMatchFilter%22:%5B%5D,%22nestedStringMatchFilters%22:%5B%5D%7D&sorts=%7B%22sortOrders%22:%5B%7B%22key%22:%22id%22,%22order%22:%22DESC%22%7D%5D%7D&pagination=null

The person wanted to extract everything in the filters portion of the URL if factType was COMMERCIAL

The following all-in-one regex pulls it out neatly (presuming the URL is always in the right order (ie factType coming before filters):

factType=(?<facttype>\w ). filters=(?<filters>[^\&] )

According to regex101, it finds its expected matches with 670 steps

But if I break it up to

factType=(?<facttype>\w )

followed by

filters=(?<filters>[^\&] )

regex101 reports the matches being found with 26 and 16 steps, respectively

What about breaking up the regex into two makes it so much more (~15x) efficient to match?

CodePudding user response:

The main problem with the regexp is the presence of the . where . eat (nearly) anything and * is generally greedy. Indeed, regexp engines are split in two categories: greedy engines and lazy ones. Greedy engines basically consume all the characters and backtrack as long as nothing is found while lazy ones consume characters only when the following pattern is not found. More engines are greedy. AFAIK, Java is the rare language to use a lazy engine by default. Fortunately, you can specify that you want a lazy quantifier with . ?. This means the engine will try to search the shortest possible match for .* instead of the longest one by default. This is what people usually do when writing a manual search. The result is 65 steps instead of 670 steps (10x better).

Note, that quantifiers do not always help in such a case. It is often better to make the regexp more precise (ie. deterministic) so to improve performance (by reducing the number of possible backtracks due to wrongly path tacking in the non-deterministic automaton).

Still, note that regexp engines are generally not very optimized compared to manual searches (as long as you use efficient search algorithms). They are great to make the code short, flexible and maintainable. For high-performance, a basic loop in native languages is often better. This is especially true if it is vectorized using SIMD instructions (which is generally not easy).

CodePudding user response:

Here is a regex that would be inherently more efficient than . or . ? irrespective of the positions of those matches in input text.

factType=(?<facttype>\w )(?:&(?!filters=)[^&\s]*)*&filters=(?<filters>[^\&] )

This regex may look bit longer but it will be more efficient because we are using a negative lookahead (?!filters=) after matching & to stop the match just before filters query parameter.

Q. What is backtracking?
A. In simple words: If a match isn't complete, the engine will backtrack the string to try to find a whole match until it succeeds or fails. In the above example if you use . it matches longest possible match till the end of input then starts backtracking one position backward at a time to find the whole match of second pattern. When you use . ? it just does lazy match and moves forward one position at a time to get the full match.

This suggested approach is far more efficient than .* or . or . ? approaches because it avoids expensive backtracking while trying to find match of the second pattern.

RegEx Details:

  • factType=: Match factType=
  • (?<facttype>\w ): Match 1 word characters and capture in named group facttype
  • (?:: Start non-capture group
    • &: Match a &
    • (?!filters=): Stop matching when we have filters= at next position
    • [^&\s]*: Match 0 or more of non-space non-& chars
  • )*: End non-capture group. Repeat this group 0 or more times
  • &: Match a &
  • filters=: Match filters=
  • (?<filters>[^\&] ): Match 1 or more of non-space non-& chars and capture in named group filters

Related article on catastrophic backtracking

  • Related