Home > front end >  How to not match substrings with regex
How to not match substrings with regex

Time:01-25

String to match:

{abc}

Strings to not match:

$${abc{abc}{abc}}$$

How do I satisfy this requirement with regex?
The context is trying to match {abc} elements for replacement with Python, but I don't want them mixed up with MathJax equations $${abc{abc}{abc}}$$ in a HTML file.

I understand [^$]{. } will work somewhat for strings such as "$${}$$", but not others with nested brackets (with content inside nested brackets) such as "$${{abc}}$$". Which shouldn't be matched.

Sample:

Match:
{element 1}
{element 2}
{element_abc}

Don't match:
$${abc{abc}{abc}}$$
$${mathjax{}{mathjax}}$$
$${mathjax{}{}{}{}{{{mathjax}}}}$$

With matches of:

  • {element 1}
  • {element 2}
  • {element_abc}

The search doesn't need to scan recursively for intermixed elements:

  • {$${}$$} can match (not possible in my actual text, so a match can be made if necessary)
  • A line of a {abc} and a $${abc}$$ such as {abc} abc $${abc}$$ may be possible

Using regex 2021.11.10 via pip

CodePudding user response:

If you simply need to match the expressions on individual lines, all you need is to add line anchors.

^\{[^{}] \}$

If your input is a single string with multiple lines in it, you'll need to add the re.MULTILINE flag to say that ^ and $ should match at internal newlines, too.

>>> import re
>>> re.findall(r'^\{[^{}] \}$', '''
... {foo}
... $${bar{baz}}
... {quux}
... ick
... ''', re.MULTILINE)
['{foo}', '{quux}']

This is portable back to the standard Python re module, too.

CodePudding user response:

With PyPi regex library, you can use a SKIP-FAIL recursion-based regex like

\$\$({(?:[^{}]  |(?1))*})\$\$(*SKIP)(*F)|{([^{}]*)}

See the regex demo. Details:

  • \$\$({(?:[^{}] |(?1))*})\$\$(*SKIP)(*F):
    • \$\$ - a $$ string
    • ({(?:[^{}] |(?1))*}) - Group 1: a {, then any zero or more occurrences of any one or more chars other than { and }} or the same Group 1 pattern recursed, and then }
    • \$\$ - a $$ string
    • (*SKIP)(*F) - "forget" the text matched up to this moment
  • | - or
  • {([^{}]*)} - {, then Group 2 capturing any zero or more chars other than { and }, and then a }.

In Python, you can use

import regex
text = '{element 1}  {element 2} {element_abc} $${abc{abc}{abc}}$$ $${mathjax{}{mathjax}}$$ $${mathjax{}{}{}{}{{{mathjax}}}}$$'
pattern = regex.compile( r'\$\$({(?:[^{}]  |(?1))*})\$\$(*SKIP)(*F)|{([^{}]*)}' )
print( [match.group() for match in pattern.finditer(text)] )
# => ['{element 1}', '{element 2}', '{element_abc}']
print( [match.group(2) for match in pattern.finditer(text)] )
# => ['element 1', 'element 2', 'element_abc']

See this online demo.

  •  Tags:  
  • Related