How do you make an "opposite" pattern? - Needed to define a grammar with lark-CodePudding

The regex [a-zA-Z-] :: matches strings like one-name::.

I would like a regex that will do the opposite. Is there an "automated" way to build such a regex?

I can't just check the failure of the first regex, as I have to use a "negated" regex with lark.

Context of use

Indeed, my question is related to the code below that doesn't work because line must also reject blockname.

TEXT = """

// Comment #1

abrev::
    ok = ok

=========
One title
=========

Bla, bla
Bli, Bli

verbatim::
    OK, or not ok ?
    That is...

    ...the question.

Blu, Blu


==
10
==

// Comment #2
Blo, blo




==
05
==
    """

from lark import Lark

GRAMMAR = r"""
?start: _NL* (heading | comments | block)*

heading : ruler _NL title _NL ruler _NL  (block | comments | paragraph)*
ruler   : /={2,}/
title   : /[^\n={2}\/{2}] /

comments : "//" cline _NL*

paragraph : (line _NL) 

block : blockname _SINGLE_NL (tline _NL) 
blockname : /[a-zA-Z-] ::/

tline : "    " /[^\n] /
cline : /[^\n] /
line  : /[^\n={2}\/{2}] /

_NL        : /(\r?\n[\t ]*) /
_SINGLE_NL : /([\t ]*\r?\n)/
"""

parser = Lark(GRAMMAR)

tree = parser.parse(TEXT)

print(tree.pretty())

I have the following bad output.

start
  comments
    cline        Comment #1
  block
    blockname   abrev::
    tline       ok = ok
  heading
    ruler       =========
    title       One title
    ruler       =========
    paragraph
      line      Bla, bla
      line      Bli, Bli
      line      verbatim::             <<< BAD
      line      OK, or not ok ?        <<< BAD
      line      That is...             <<< BAD
      line      ...the question.       <<< BAD
      line      Blu, Blu               <<< BAD
  heading
    ruler       ==
    title       10
    ruler       ==
    comments
      cline      Comment #2
    paragraph
      line      Blo, blo
  heading
    ruler       ==
    title       05
    ruler       ==

I would like to obtain something like:

start
  comments
    cline        Comment #1
  block
    blockname   abrev::
    tline       ok = ok
  heading
    ruler       =========
    title       One title
    ruler       =========
    paragraph
      line      Bla, bla
      line      Bli, Bli
    block                              <<< GOOD 
      blockname verbatim::             <<< GOOD
      tline     OK, or not ok ?        <<< GOOD
      tline     That is...             <<< GOOD
      tline     ...the question.       <<< GOOD
    paragraph                          <<< GOOD
      line      Blu, Blu               <<< GOOD
  heading
    ruler       ==
    title       10
    ruler       ==
    comments
      cline      Comment #2
    paragraph
      line      Blo, blo
  heading
    ruler       ==
    title       05
    ruler       ==

CodePudding user response：

This is probably not the best solution to this parsing problem, but it's relatively easy. Since the regexes here are Python regexes, you are free to use negative lookahead assertions, so you can put (?![a-zA-Z-] ::) at the beginning of your line pattern to avoid having that pattern match a blockname.

But the rest of that pattern is not doing what you think it is doing; a character class (inside [ and ]) is just a set of characters; it cannot contain subpatterns like ={2}. All that means inside a character class is "one of the characters =, {, 2 or }. Again, negative lookahead assertions are probably what you want. I changed line to this:

line  : /(?![a-zA-Z-] ::)((?!==|\/\/).) /

and it more or less worked, except that your _NL pattern absorbs whitespace after the matched \n, which means that it will absorb the indent at the beginning of the second line in the block, so that won't be matched as a bline. You really need to rethink your whitespace matching strategy, but that's a different question.