The regex [a-zA-Z-] ::
matches strings like one-name::
.
I would like a regex that will do the opposite. Is there an "automated" way to build such a regex?
I can't just check the failure of the first regex, as I have to use a "negated" regex with lark
.
Context of use
Indeed, my question is related to the code below that doesn't work because line
must also reject blockname
.
TEXT = """
// Comment #1
abrev::
ok = ok
=========
One title
=========
Bla, bla
Bli, Bli
verbatim::
OK, or not ok ?
That is...
...the question.
Blu, Blu
==
10
==
// Comment #2
Blo, blo
==
05
==
"""
from lark import Lark
GRAMMAR = r"""
?start: _NL* (heading | comments | block)*
heading : ruler _NL title _NL ruler _NL (block | comments | paragraph)*
ruler : /={2,}/
title : /[^\n={2}\/{2}] /
comments : "//" cline _NL*
paragraph : (line _NL)
block : blockname _SINGLE_NL (tline _NL)
blockname : /[a-zA-Z-] ::/
tline : " " /[^\n] /
cline : /[^\n] /
line : /[^\n={2}\/{2}] /
_NL : /(\r?\n[\t ]*) /
_SINGLE_NL : /([\t ]*\r?\n)/
"""
parser = Lark(GRAMMAR)
tree = parser.parse(TEXT)
print(tree.pretty())
I have the following bad output.
start
comments
cline Comment #1
block
blockname abrev::
tline ok = ok
heading
ruler =========
title One title
ruler =========
paragraph
line Bla, bla
line Bli, Bli
line verbatim:: <<< BAD
line OK, or not ok ? <<< BAD
line That is... <<< BAD
line ...the question. <<< BAD
line Blu, Blu <<< BAD
heading
ruler ==
title 10
ruler ==
comments
cline Comment #2
paragraph
line Blo, blo
heading
ruler ==
title 05
ruler ==
I would like to obtain something like:
start
comments
cline Comment #1
block
blockname abrev::
tline ok = ok
heading
ruler =========
title One title
ruler =========
paragraph
line Bla, bla
line Bli, Bli
block <<< GOOD
blockname verbatim:: <<< GOOD
tline OK, or not ok ? <<< GOOD
tline That is... <<< GOOD
tline ...the question. <<< GOOD
paragraph <<< GOOD
line Blu, Blu <<< GOOD
heading
ruler ==
title 10
ruler ==
comments
cline Comment #2
paragraph
line Blo, blo
heading
ruler ==
title 05
ruler ==
CodePudding user response:
This is probably not the best solution to this parsing problem, but it's relatively easy. Since the regexes here are Python regexes, you are free to use negative lookahead assertions, so you can put (?![a-zA-Z-] ::)
at the beginning of your line
pattern to avoid having that pattern match a blockname.
But the rest of that pattern is not doing what you think it is doing; a character class (inside [
and ]
) is just a set of characters; it cannot contain subpatterns like ={2}
. All that means inside a character class is "one of the characters =
, {
, 2
or }
. Again, negative lookahead assertions are probably what you want. I changed line
to this:
line : /(?![a-zA-Z-] ::)((?!==|\/\/).) /
and it more or less worked, except that your _NL
pattern absorbs whitespace after the matched \n
, which means that it will absorb the indent at the beginning of the second line in the block, so that won't be matched as a bline
. You really need to rethink your whitespace matching strategy, but that's a different question.