Home > Back-end >  How to match nested LaTeX macros with re in Python?
How to match nested LaTeX macros with re in Python?

Time:04-07

I wanted to match LaTeX macros correctly even the nested ones. See the following:

s = r'''
firstline
\lr{secondline\rl{ right-to-left
        \lr{nested left-to-right} end RTL }
        other text
}
\rl{ last \lr{end line 
} end RTL }
'''

For instance, in the above, I want to match the \lr macro with its content. I have tried the following but none of them worked correctly:

re.findall(r'(?:\\lr\{.*\})', s, re.DOTALL)
['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}\n\\rl{ last \\lr{end line \n} end RTL }']

even non-greedy version did not work in this case:

re.findall(r'(?:\\lr\{.*?\})', s, re.DOTALL)
['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right}',
 '\\lr{end line \n}']

I need some regular expression to match it correctly, similar to nested parentheses, here I have nested curly brackets for LaTeX macros.

edit:

I'd like to get the following matches:

['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}', 
'\\lr{nested left-to-right}',
'\\lr{end line \n}']

It would be perfect if I knew about the level of nesting, something like the below:

[('\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}',1) 
('\\lr{nested left-to-right}',2)
('\\lr{end line \n}',1)]

CodePudding user response:

With PyPi regex module (after installing it with pip install regex) you can use

import regex

s = r'''
firstline
\lr{secondline\rl{ right-to-left
        \lr{nested left-to-right} end RTL }
        other text
}
\rl{ last \lr{end line 
} end RTL }
'''

print( [x.group() for x in regex.finditer(r'\\lr(\{(?:[^{}]  |(?1))*})', s, overlapped=True)] )
# => ['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}', '\\lr{nested left-to-right}', '\\lr{end line \n}']

See the Python demo and the regex demo.

Note also the overlapped=True option used with regex.finditer that allows matching nested occurrences.

Details:

  • \\lr - \lr string
  • (\{(?:[^{}] |(?1))*}) - Group 1 (defined to be referred to while recursing):
    • \{ - a { char
    • (?:[^{}] |(?1))* - zero or more repetitions of
    • [^{}] - one or more chars other than { and } without the possibity to re-match the text again in case backtracking is triggered (i.e. it is matched possessively)
    • | - or
    • (?1) - Group 1 pattern recursed
    • } - a } char.
  • Related