Regular expression performance drop when using "caret"-CodePudding

I am trying to extract a section of text from a filing submission text file made available by the SEC.

I have noticed that, of the following two regex patterns with other things being equal, the one starting with the carat (^) takes approximately twice as long as the one without (the presence of the $ for EOL doesn't seem to affect performance):

re.compile(r"<FILENAME>[\w-]*?htm\.xml$(.*?)</DOCUMENT>", re.M | re.S)
re.compile(r"^<FILENAME>[\w-]*?htm\.xml$(.*?)</DOCUMENT>", re.M | re.S)

Given that the "<FILENAME>" string always occurs at the start of the line in the text files, I, perhaps naively, assumed that the carat would improve the regex performance rather than diminish it. Is there something more fundamental wrong with my regex or is there something here I just don't understand? I'm on a Windows machine if that could be affecting EOL.

In the case of the example filing above the section of interest begins with "<FILENAME>aapl-20210925_htm.xml".

Here is the function I'm using:

import re
from time import perf_counter

def extract_xml(textbody, main_pattern, backup_pattern):
    start = perf_counter()
    if xml := main_pattern.search(textbody):
        stop = perf_counter()
        return xml.group(1), stop - start
    elif xml := backup_pattern.search(textbody):
        stop = perf_counter()
        return xml.group(1), stop - start
    else:
        raise Exception("No pattern matched.")

Where main_pattern is one of the above regex.

Thank you in advance for taking the time to read this through!

CodePudding user response：

I did see the same difference as the OP. If you compile both regular expressions adding the re.DEBUG flag, then the compiled code for the first regex (without ^) begins with:

LITERAL 60
LITERAL 70
LITERAL 73
LITERAL 76
LITERAL 69
LITERAL 78
LITERAL 65
LITERAL 77
LITERAL 69
LITERAL 62
MIN_REPEAT 0 MAXREPEAT
  IN
    CATEGORY CATEGORY_WORD
    LITERAL 45

And the compiled code for the regex with the ^ is:

AT AT_BEGINNING
LITERAL 60
LITERAL 70
LITERAL 73
LITERAL 76
LITERAL 69
LITERAL 78
LITERAL 65
LITERAL 77
LITERAL 69
LITERAL 62
MIN_REPEAT 0 MAXREPEAT
  IN
    CATEGORY CATEGORY_WORD
    LITERAL 45

The difference is the inserted AT AT_BEGINNING directive for the second regex. Unfortunately, a cursory search did not reveal how this directive is actually implemented nor do I see an obvious reason why it does run so much more slowly. But I can conjecture the reason as follows:

For the first regex the regular expression engine initially does an efficient scan looking for the first instance of '<FILENAME>` and then proceeds to match the rest of the regex. If this failed to match, it would resume by looking for the next occurrence '<FILENAME>', but in our case the match does succeed.

For the second regex I am guessing that the regex engine explicitly looks for '<FILENAME>' at the start of string and following each newline. This fails multiple times before it finally succeeds. It can then proceed to match the rest of the regular expression. My suspicion is that the separate multiple searches for newline characters followed by separate matchings of '<FILENAME>' is what is taking the extra time. Knowing that in our particular case '<FILENAME>' will not appear at the start of the string, we can eliminate first searching for a newline character followed by searching for '<FILENAME>' by just searching for r'\n<FILENAME>'. So perhaps I can offer an alternative regex, which does what I suggest above.

Since with the re.M flag the carat (^) should match either the start of string (\A) or the start of line, i.e. following a newline, I would think that the following is more-or-less equivalent:

(?:\A|\n)<FILENAME>[\w-]*?htm\.xml$(\n.*?)</DOCUMENT>

This says that '<FILENAME>' must be either at the start of string or preceded by a newline. But this resulted in an even worse performance. I then removed the condition that what we were looking for could be at the start of the string, since for this input we know that it doesn't:

\n<FILENAME>[\w-]*?htm\.xml$(\n.*?)</DOCUMENT>

This out-performed even the original regex. Here is my benchmark:

import re
import timeit

def extract_xml(textbody, main_pattern):
    if xml := main_pattern.search(textbody):
        return xml.group(1)
    else:
        raise Exception("No pattern matched.")


with open('test.txt', 'r') as f:
    s = f.read()

rex1 = re.compile(r"<FILENAME>[\w-]*?htm\.xml$(.*?)</DOCUMENT>", re.M | re.S)

rex2 = re.compile(r"^<FILENAME>[\w-]*?htm\.xml$(.*?)</DOCUMENT>", re.M | re.S)
# Equivalent (?) to above rex2:
rex2a = re.compile(r"(?:\A|\n)<FILENAME>[\w-]*?htm\.xml$(\n.*?)</DOCUMENT>", re.M | re.S)
# Relaxing condition that what we are looking for can be at start of the string:
rex2b = re.compile(r"\n<FILENAME>[\w-]*?htm\.xml$(.*?)</DOCUMENT>", re.M | re.S)


print(timeit.timeit(stmt='globals()["s1"] = extract_xml(s, rex1)', number=50, globals=globals()))
print(timeit.timeit(stmt='globals()["s2"] = extract_xml(s, rex2)', number=50, globals=globals()))
print(timeit.timeit(stmt='globals()["s2a"]  = extract_xml(s, rex2a)', number=50, globals=globals()))
print(timeit.timeit(stmt='globals()["s2b"] = extract_xml(s, rex2b)', number=50, globals=globals()))

print(s1 == s2 and s2 == s2a and s2a == s2b)

Prints:

1.6319537999999998
3.7008588000000002
6.279504599999999
1.545982799999999
True