Home > Mobile >  developing parser with pegen: no output
developing parser with pegen: no output

Time:02-27

I want to write a parser for a pre-existing data storage file type. There is a formal grammar, and I am able to follow the syntax guidelines for pegen to create a grammar file and have that compile and produce a parser.

My problem is that the parser doesn't produce any output, as (at least I think this is the problem) I don't know how to set the correct return types in the grammar file. The examples in the github data folder aren't that helpful.

How do I create the correct return types?

My grammar file:

# Basic CIF structure
start: Comments? WhiteSpace? ( DataBlock ( WhiteSpace DataBlock )* ( WhiteSpace )? )?
DataBlock: DataBlockHeading ( WhiteSpace ( DataItems | SaveFrame ) )*
DataBlockHeading: DATA_ ( NonBlankChar ) 
SaveFrame: SaveFrameHeading ( WhiteSpace DataItems )  WhiteSpace SAVE_
SaveFrameHeading: SAVE_ ( NonBlankChar ) 
DataItems: Tag WhiteSpace Value | LoopHeader LoopBody
LoopHeader: LOOP_ ( WhiteSpace Tag ) 
LoopBody: Value ( WhiteSpace Value )*

# Reserved words
DATA_: ('D' | 'd') ('A' | 'a') ('T' | 't') ('A' | 'a') '_'
LOOP_: ('L' | 'l') ('O' | 'o') ('O' | 'o') ('P' | 'p') '_'
GLOBAL_: ('G' | 'g') ('L' | 'l') ('O' | 'o') ('B' | 'b') ('A' | 'a') ('L' | 'l') '_'
SAVE_: ('S' | 's') ('A' | 'a') ('V' | 'v') ('E' | 'e') '_'
STOP_:  ('S' | 's') ('T' | 't') ('O' | 'o') ('P' | 'p')'_'

# Tags and values
Tag: '_' ( NonBlankChar) 
Value: ( '.' | '?' | Numeric | CharString | TextField )

# Numeric values
Numeric: ( Number | Number '(' UnsignedInteger ')' )
Number: Integer | Float
Integer: ( ' ' | '-' )? UnsignedInteger
Float: ( Integer Exponent | ( ( ' ' | '-' )? ( ( Digit )* '.' UnsignedInteger ) | ( ( Digit )  '.' ) ) ( Exponent )? )
Exponent: ( ('e' | 'E' ) | ( 'e' | 'E' ) ( ' ' | '- ' ) ) UnsignedInteger
UnsignedInteger: ( Digit ) 
Digit: ( '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' )

# Strings and text fields
CharString: UnquotedString | SingleQuotedString | DoubleQuotedString
UnquotedString: EOL_UnquotedString | NOTEOL_UnquotedString
EOL_UnquotedString: EOL OrdinaryChar ( NonBlankChar )*
NOTEOL_UnquotedString: NOTEOL ( OrdinaryChar | ';' ) ( NonBlankChar )*
SingleQuotedString: single_quote ( AnyPrintChar )* single_quote WhiteSpace
DoubleQuotedString: double_quote ( AnyPrintChar )* double_quote WhiteSpace
TextField: ( SemiColonTextField )
SemiColonTextField: EOL ';' ( ( AnyPrintChar )* EOL ( ( TextLeadChar ( AnyPrintChar )* )? EOL )* ) ';'

# Whitespace and comments
WhiteSpace: ( SP | HT | EOL | TokenizedComments ) 
Comments: ( '#' ( AnyPrintChar )* EOL ) 
TokenizedComments: ( SP | HT | EOL )  Comments

# Character sets
OrdinaryChar: ( '!' | '%' | '&' | '(' | ')' | '*' | ' ' | ',' | '-' | '.' | '/' | '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | ':' | '<' | '=' | '>' | '?' | '@' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' | '\\' | '^' | '`' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' | '{' | '|' | '}' | '~' )
NonBlankChar: ( OrdinaryChar | double_quote | '#' | '$' | single_quote | '_' | ';' | '[' | ']' )
TextLeadChar: ( OrdinaryChar | double_quote | '#' | '$' | single_quote | '_' | SP | HT | '[' | ']' )
AnyPrintChar: ( OrdinaryChar | double_quote | '#' | '$' | single_quote | '_' | SP | HT | ';' | '[' | ']' )


# Special things
EOL: NEWLINE #( '\n' | '\n\r' )
NOTEOL: !EOL
SP: ' '
HT: '\t'
double_quote: '"'
single_quote: '\''

My test file (header_only.cif) to parse:

data_header

How I generated the parser:

python -m pegen cif.gram -o parser.py

How I used my parser:

python parser.py -vv header_only.cif

My output:

start() ... (looking at 1.0: NAME:'data_header')
  Comments() ... (looking at 1.0: NAME:'data_header')
    _loop1_42() ... (looking at 1.0: NAME:'data_header')
      _tmp_58() ... (looking at 1.0: NAME:'data_header')
        expect('#') ... (looking at 1.0: NAME:'data_header')
        ... expect('#') -> None
      ... _tmp_58() -> None
    ... _loop1_42() -> []
  ... Comments() -> None
  WhiteSpace() ... (looking at 1.0: NAME:'data_header')
    _loop1_41() ... (looking at 1.0: NAME:'data_header')
      _tmp_57() ... (looking at 1.0: NAME:'data_header')
        SP() ... (looking at 1.0: NAME:'data_header')
          expect(' ') ... (looking at 1.0: NAME:'data_header')
          ... expect(' ') -> None
        ... SP() -> None
        HT() ... (looking at 1.0: NAME:'data_header')
          expect('\t') ... (looking at 1.0: NAME:'data_header')
          ... expect('\t') -> None
        ... HT() -> None
        EOL() ... (looking at 1.0: NAME:'data_header')
          expect('NEWLINE') ... (looking at 1.0: NAME:'data_header')
          ... expect('NEWLINE') -> None
        ... EOL() -> None
        TokenizedComments() ... (looking at 1.0: NAME:'data_header')
          _loop1_43() ... (looking at 1.0: NAME:'data_header')
            _tmp_59() ... (looking at 1.0: NAME:'data_header')
              SP() -> None
              HT() -> None
              EOL() -> None
            ... _tmp_59() -> None
          ... _loop1_43() -> []
        ... TokenizedComments() -> None
      ... _tmp_57() -> None
    ... _loop1_41() -> []
  ... WhiteSpace() -> None
  _tmp_1() ... (looking at 1.0: NAME:'data_header')
    DataBlock() ... (looking at 1.0: NAME:'data_header')
      DataBlockHeading() ... (looking at 1.0: NAME:'data_header')
        DATA_() ... (looking at 1.0: NAME:'data_header')
          _tmp_8() ... (looking at 1.0: NAME:'data_header')
            expect('D') ... (looking at 1.0: NAME:'data_header')
            ... expect('D') -> None
            expect('d') ... (looking at 1.0: NAME:'data_header')
            ... expect('d') -> None
          ... _tmp_8() -> None
        ... DATA_() -> None
      ... DataBlockHeading() -> None
    ... DataBlock() -> None
  ... _tmp_1() -> None
... start() -> [None, None, None]
[None, None, None]
Total time: 0.031 sec; 1 lines (13 bytes); 32 lines/sec
Caches sizes:
  token array :          1
        cache :         24

CodePudding user response:

Pegen generates parsers for "python-like" languages. As far as I can tell, it is not intended to be a general-purpose parser generator.

In particular, it assumes that the lexical structure of the language being parsed is sufficiently similar to Python's that the same tokeniser can be used. That doesn't appear to be the case for the language you want to parse. In particular, your language has no equivalent to the NAME token automatically generated by the Python tokeniser when it sees the input data_header, which is why the parse fails.

Pegen does allow you to define keywords, which are particular instances of NAME, but as far as I know it has no way to specify a case-independent keyword. Nor does it have a mechanism for recognising the class of NAMEs starting with a prefix (like "data_"). These are both tasks which could easily be accomplished with regular expressions.

Python has a large range of parser generators and the vast majority allow custom tokenisers based on regular expressions, which is a lot more convenient than including huge lists of single characters. You might find that one of these better suits your purpose. As far as I can see, your language could be parsed with a simple top-down predictive parser (LL(1) or "recursive descent"), so pretty well any general-purpose parser generator should work, even a PEG generator.

  • Related