I wanted to match LaTeX macros correctly even the nested ones. See the following:
s = r'''
firstline
\lr{secondline\rl{ right-to-left
\lr{nested left-to-right} end RTL }
other text
}
\rl{ last \lr{end line
} end RTL }
'''
For instance, in the above, I want to match the \lr
macro with its content. I have tried the following but none of them worked correctly:
re.findall(r'(?:\\lr\{.*\})', s, re.DOTALL)
['\\lr{secondline\\rl{ right-to-left\n \\lr{nested left-to-right} end RTL }\n other text\n}\n\\rl{ last \\lr{end line \n} end RTL }']
even non-greedy version did not work in this case:
re.findall(r'(?:\\lr\{.*?\})', s, re.DOTALL)
['\\lr{secondline\\rl{ right-to-left\n \\lr{nested left-to-right}',
'\\lr{end line \n}']
I need some regular expression to match it correctly, similar to nested parentheses, here I have nested curly brackets for LaTeX macros.
edit:
I'd like to get the following matches:
['\\lr{secondline\\rl{ right-to-left\n \\lr{nested left-to-right} end RTL }\n other text\n}',
'\\lr{nested left-to-right}',
'\\lr{end line \n}']
It would be perfect if I knew about the level of nesting, something like the below:
[('\\lr{secondline\\rl{ right-to-left\n \\lr{nested left-to-right} end RTL }\n other text\n}',1)
('\\lr{nested left-to-right}',2)
('\\lr{end line \n}',1)]
CodePudding user response:
With PyPi regex module (after installing it with pip install regex
) you can use
import regex
s = r'''
firstline
\lr{secondline\rl{ right-to-left
\lr{nested left-to-right} end RTL }
other text
}
\rl{ last \lr{end line
} end RTL }
'''
print( [x.group() for x in regex.finditer(r'\\lr(\{(?:[^{}] |(?1))*})', s, overlapped=True)] )
# => ['\\lr{secondline\\rl{ right-to-left\n \\lr{nested left-to-right} end RTL }\n other text\n}', '\\lr{nested left-to-right}', '\\lr{end line \n}']
See the Python demo and the regex demo.
Note also the overlapped=True
option used with regex.finditer
that allows matching nested occurrences.
Details:
\\lr
-\lr
string(\{(?:[^{}] |(?1))*})
- Group 1 (defined to be referred to while recursing):\{
- a{
char(?:[^{}] |(?1))*
- zero or more repetitions of[^{}]
- one or more chars other than{
and}
without the possibity to re-match the text again in case backtracking is triggered (i.e. it is matched possessively)|
- or(?1)
- Group 1 pattern recursed}
- a}
char.