ANTLR4 - Sporadic Elements in Strict Ordering Grammar-CodePudding

I am using antlr 4.10.1 maven plugin to generate a java parser. I have defined a grammar for a data format that follows a strict ordering of elements. However, this format also allows for "freetext" elements that can be inserted anywhere between two complete elements. I have defined a lexer rule that covers these freetext elements in their entirety and handle "unplanned" appearances using a custom error handler that allow the parsing to continue in such a case.

I am trying to figure out if there are "better" ways to handle such a scenario, other than to surround all other elements with these optional freetext elements in the grammar.

Note that when a freetext element is present, it relates to the preceeding element so I would also like to be able to recover this relation.

Any ideas?

The following is an example of such a grammar.

grammar POC;

FIELD_SEPERATOR : '-';
FIELD_VALUE : ~[-\r\n] ;
EOL : '\r\n';
FREETEXT : 'FREETEXT' FIELD_SEPERATOR .*? '\n';


poc : data1 data2 data3? data4 EOF;
data1 : 'DATA1' FIELD_SEPERATOR FIELD_VALUE EOL;
data2 : 'DATA2' FIELD_SEPERATOR FIELD_VALUE EOL;
data3 : 'DATA3' FIELD_SEPERATOR FIELD_VALUE EOL;
data4 : 'DATA4' FIELD_SEPERATOR FIELD_VALUE EOL;

The following is an example of the data to test the grammar.

DATA1-this is a value
DATA2-this is a value
FREETEXT-This is a freetext
DATA3-this is a value
DATA4-this is a value

EDITED TO ADD ANSWER

With minor modification to the grammar above.

FREETEXT : 'FREETEXT' FIELD_SEPERATOR .*? '\n' -> channel(HIDDEN);

The resulting java program:

public class POC extends POCBaseListener {

    private static BufferedTokenStream freeTextBuffer = null;

    public static void main(String[] args) throws IOException {

        var lexer = new POCLexer(CharStreams.fromStream(POC.class.getResourceAsStream("/poc.txt")));
        var lexer2 = new POCLexer(CharStreams.fromStream(POC.class.getResourceAsStream("/poc.txt")));

        var parser = new POCParser(new CommonTokenStream(lexer));

        freeTextBuffer = new BufferedTokenStream(lexer2);
        freeTextBuffer.fill();

        ParseTree tree = parser.poc();
        ParseTreeWalker walker = new ParseTreeWalker();
        walker.walk(new POC(), tree);
    }

    @Override
    public void exitEveryRule(ParserRuleContext ctx) {
        int tokenIndex = ctx.stop.getTokenIndex();
        var freetext = freeTextBuffer.getHiddenTokensToRight(tokenIndex);
        if (freetext != null) {
            System.out.println(String.format("Found freetext for: %s", ctx.getStart().getText()));
            var i = freetext.iterator();
            while (i.hasNext()) {
                var text = i.next().getText().replace("\n", "");
                System.out.println(String.format(" -> %s", text));
            }
        }
    }

Resulting console output:

Found freetext for: DATA2
 -> FREETEXT-This is a freetext

CodePudding user response：

You could add a -> skip action to your FREETEXT rule:

FREETEXT : 'FREETEXT' FIELD_SEPERATOR .*? '\n’ -> skip;

(This is how whitespace is frequently handled in other grammars. With the skip action, the FREETEXT token will not appear in your token stream and will not need to be accounted for in your parser rules.)

Another option is to send it to another channel

FREETEXT : 'FREETEXT' FIELD_SEPERATOR .*? '\n' -> channel(HIDDEN);

This has the a similar effect of making the FREETEXT Tokens “irrelevant” to the parser rules, but they are still a part of the token stream, and you can interrogate the token stream if you have some need to interrogate them.

Since you mention that you want to recover it’s relationship to the previous element, the channel options would be the way to go. Then in a listener or visitor, you can use getHiddenTokensToRight or similar methods on BufferedTokenStream to look at them.