Home > Software design >  Regex explanation of sphinx gallery
Regex explanation of sphinx gallery

Time:04-05

I am debugging sphinx gallery tooltip generation which involves following code:

def extract_intro_and_title(filename, docstring):
    """Extract and clean the first paragraph of module-level docstring."""
    # lstrip is just in case docstring has a '\n\n' at the beginning
    paragraphs = docstring.lstrip().split('\n\n')
    # remove comments and other syntax like `.. _link:`
    paragraphs = [p for p in paragraphs
                  if not p.startswith('.. ') and len(p) > 0]
    if len(paragraphs) == 0:
        raise ExtensionError(
            "Example docstring should have a header for the example title. "
            "Please check the example file:\n {}\n".format(filename))
    # Title is the first paragraph with any ReSTructuredText title chars
    # removed, i.e. lines that consist of (3 or more of the same) 7-bit
    # non-ASCII chars.
    # This conditional is not perfect but should hopefully be good enough.
    title_paragraph = paragraphs[0]
    match = re.search(r'^(?!([\W _])\1{3,})(. )', title_paragraph,
                      re.MULTILINE)

    if match is None:
        raise ExtensionError(
            'Could not find a title in first paragraph:\n{}'.format(
                title_paragraph))
    title = match.group(0).strip()
    # Use the title if no other paragraphs are provided
    intro_paragraph = title if len(paragraphs) < 2 else paragraphs[1]
    # Concatenate all lines of the first paragraph and truncate at 95 chars
    intro = re.sub('\n', ' ', intro_paragraph)
    intro = _sanitize_rst(intro)
    if len(intro) > 95:
        intro = intro[:95]   '...'
    return intro, title

The line which I do not understand is:

match = re.search(r'^(?!([\W _])\1{3,})(. )', title_paragraph,
                  re.MULTILINE)

Can someone explain it to me please?

CodePudding user response:

To start:

>>> import re
>>> help(re.search)
Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.
(END)

That tells us that re.search takes a pattern, a string, and optional flags that default to 0.

That probably doesn't help much on its own.

The flag being passed is re.MULTILINE. That tells the regular expression engine to treat ^ and $ as the beginning and end of each line. The default, those apply to the beginning and end of the string, regardless of how many lines make up the string.

The pattern that's being matched is looking for for the following:

^ - the pattern must start at the beginning of each line

(?!([\W _])\1{3,}) - the first four characters can't be: non-word characters (\W), spaces ( ) or underscores (_). This is using a negative lookahead ((?! ... )) matching a character group (([\W _])) in parentheses, meaning capture group 1. This match has to repeat 3 or more times (\1{3,}). \1 signaling the contents of capture group 1, and {3,} meaning at least 3 times. The match plus the 3 repeats of the match enforces that the first 4 characters can't be repeating non-word characters. This match doesn't consume any characters, it only matches a position if the condition is true.

As a side note, \W matches the opposite of \w, which is shorthand for [A-Za-z0-9_]. This means \W is shorthand for [^A-Za-z0-9_]

(. ) - If the previous positional match was successful, if the line consists of 1 or more characters, the entire line will be matched in capture group 2.

https://regex101.com/r/3p73lf/1 to explore the behavior of the regular expression.

  • Related