The Goal

The goal is interpret plain text content and recognise patterns e.g. Arithmetic, Comments, Units of Measurements.

Example Input

This would be entered by a user.

# This is an example comment
10   10
// Another comment
This is one line of text
Tested

Expected Parse Tree

The goal of my grammar is to generate a tree that would look like this (if anyone has a better method I'd be interested to hear).

Note: The 10 10 is being recognised as an arithmetic rule.

Current Parse Tree aka The Problem

Below is the current output from the lexer and parser.

Note: The 10 10 is being recognised as an text and not the arithmetic rule.

Grammar Definition

The logic of the grammar at a high levels is as follows:

Parse line by line
Determine the line content if not fall back to text

grammar ContentParser;

/*
 * Tokens
 */
NUMBER: '-'? [0-9] ;
LPARAN: '(';
RPARAN: ')';
POW: '^';
MUL: '*';
DIV: '/';
ADD: ' ';
SUB: '-';

LINE_COMMENT: '#' TEXT | '//' TEXT;
TEXT: ~[\n\r] ;
EOL: '\r'? '\n';

/*
 * Rules
 */
start: file;
file: line  EOF;
line: content EOL;

content
    : comment 
    | arithmetic 
    | text
    ;

// Custom Content Types
comment: LINE_COMMENT;

/// Example taken from ANTLR Docs
arithmetic:
    NUMBER                                                          # Number
    | LPARAN inner = arithmetic RPARAN                              # Parentheses
    | left = arithmetic operator = POW right = arithmetic           # Power
    | left = arithmetic operator = (MUL | DIV) right = arithmetic   # MultiplicationOrDivision
    | left = arithmetic operator = (ADD | SUB) right = arithmetic   # AdditionOrSubtraction;

text: TEXT;

My Understanding

The content rule should check for a match of the comment rule then followed by the arithmetic rule and finally falling back to the text rule which matches any character apart from newlines.

However, I believe that the lexer is being greedy on the TEXT tokens which is causing issues but I'm not sure.

(I'm still learning ANTLR)

CodePudding user response：

When you are writing a parser, it's always a good idea to print out the tokens for the input.

In the current grammar, 10 10 is recognized by the lexer as TEXT, which is not what is needed. The reason it is text is because that is the longest string matched by a rule. It does not matter in this case that the TEXT rule occurs after the NUMBER rule in the grammar. The rule is that Antlr lexers will always match the longest string possible of the given lexer rules. But, if it can match two or more lexer rules where the strings are of equal length, then the first rule "wins". The lexer works pretty much independently of the parser.

There is no way to reliably have spaces in a text string, and not have them in arithmetic. The fix is to push spaces and tabs into an "off-channel" stream, then reconstruct the text by looking at the start and end character indices of the first and last tokens for the text tree node. The tree is a little messier, but it does what you need.

Also, you should just name the grammar as "Context" not "ContextParser" because you end up with "ContextParserParser.java" and "ContextParserLexer.java" when you generate the parser--rather confusing. I also took liberty to remove labeling an variables (I don't used them because I work with XPath expressions on the tree). And, I reordered and reformatted the grammar to be single line, sort alphabetically in order to find rules quicker in a text editor rather than require an IDE to navigate around.

A grammar that does all this is:

grammar Content;
arithmetic: NUMBER | LPARAN arithmetic RPARAN | arithmetic POW arithmetic | arithmetic (MUL | DIV) arithmetic | arithmetic (ADD | SUB) arithmetic ;
comment: LINE_COMMENT;
content : comment | arithmetic | text ;
file: line  EOF;
line: content? EOL;
start: file;
text: TEXT ;
ADD: ' ';
DIV: '/';
LINE_COMMENT: '#' STUFF | '//' STUFF;
LPARAN: '(';
MUL: '*';
NUMBER: '-'? [0-9] ;
POW: '^';
RPARAN: ')';
SUB: '-';
fragment STUFF : ~[\n\r]* ;
EOL: '\r'? '\n';
WS : [ \t]  -> channel(HIDDEN);
TEXT: .; // Must be last lexer rule, and only one char in length.