Reversed regex pattern much slower than original-CodePudding

I have a rather complex regex search in place, and I recently had to extend it to a "reversed" pattern. This was easy to implement, but performance dropped roughly by a factor of 10.

I would appreciate any tips on how to improve this problem, but I'm not particular about how. It's ok, for example, to use two or more steps if this is faster than a single, very involved one.

Original pattern

import regex

expression = regex.compile(
    fr"""
(?:{CUE})  # cue before targets. non capture (?:)
(?:{PADDING})  # text before the match.
(?:
(?:{ITEM_SEPARATION})?  # none or once
(?P<targets>{targets_as_regex})
) 
""",
    regex.VERBOSE,
)

Need to know: CUE is a rather short list of options, something like:

CUE = r"""word
|two\swords
|simple\soptions?
"""

This is about 5-15 options, quite simple (a bit simpler for the reversed scenario).

PADDING is literally anything, but no longer than 150. (PADDING = r".{0,150}?"). In the reversed scenario it's a bit more restrictive, only [,A-Za-z\d\s]

ITEM_SEPARATION is simple: optional spaces followed by either comma, bounded " y " and more optional spaces:

ITEM_SEPARATION = r"""
\s*          # optional spaces
(?:,|\by\b)  # non-capture, either comma, or 'y' (with bounds) 
\s*          # optional spaces
"""

Finally, the problem is likely in targets_as_regex, which is a big list of bounded names. This can easily be hundreds or thousands of options, although each call will pass a unique list of targets.

The idea is to match something like:

This is a text, with a cue, then a bit more text and a match, other match y final match. And after the "." we stop matching.

And it performs sufficiently well. But then we have the reversed situation:

Reversed pattern

The idea is to match this:

This not a match, because of the stop. This is a match, another match y final match, because now we find a cue. Etc.

I naturally recycled my pattern:

expression = regex.compile(
    fr"""
(
(?P<targets>{targets_as_regex})
(?:{ITEM_SEPARATION})?  
) 
(?:{SIMPLE_PADDING})  # text before the match
(?:{CUE_AFTER})  # cue after targets
""",
    regex.VERBOSE,
)

It works as expected, but performance is very poor when targets (that are many) come before cues (which are few).

What I tried

I added a basic check where we omit the whole thing if the cue is not in the text at all, but it did not help very much. Another idea that I have tested is to search all matches for the targets (regardless of cues and all that) and then pass a shortened list (targets_as_regex) with only these matches to the slow pattern. This actually helps, but not by much (still close to that x10 drop in performance).

Any other ideas that could help?

CodePudding user response：

The ((?P<targets>{targets_as_regex})(?:{ITEM_SEPARATION})?) is a well-known catastrophic backtracking prone pattern (analogous to (a b?) which is reduced to (a ) ), and the more to the left of the pattern, the more dangerous.

You need to "unroll" patterns like (a b?) into a (?:ba )* making the subsequent subpatterns match only at different positions inside the string.

The pattern you need is

fr"""(?:{targets_as_regex})(?:{ITEM_SEPARATION}(?:{targets_as_regex}))*(?:{SIMPLE_PADDING})(?:{CUE_AFTER})"""

To speed it up, you need a regex trie built from the targets_as_regex, see Speed up millions of regex replacements in Python 3.