Home > Net >  ANTLR4 lexer rule creates errors or conflicts on perl grammar
ANTLR4 lexer rule creates errors or conflicts on perl grammar

Time:09-17

I am having an issue on my PERL grammar, here are the relevant parts of my grammar :

element
    : element (ASTERISK_CHAR | SLASH_CHAR | PERCENT_CHAR) element
    | word
    ;

SLASH_CHAR:                 '/';

REGEX_STRING
    : '/' (~('/' | '\r' | '\n') | NEW_LINE)* '/'
    ;

fragment NEW_LINE
    : '\r'? '\n'
    ;

If the rule REGEX_STRING is not commented, then the following perl doesn't parse :

$b = 1/2;
$c = 1/2;
<2021/08/20-19:24:37> <ERROR> [parsing.AntlrErrorLogger] - Unit 1: <unknown>:2:6: extraneous input '/2;\r\n$c = 1/' expecting {<EOF>, '=', '**=', ' =', '-=', '.=', '*=', '/=', '%=', CROSS_EQUAL, '&=', '|=', '^=', '&.=', '|.=', '^.=', '<<=', '>>=', '&&=', '||=', '//=', '==', '>=', '<=', '<=>', '<>', '!=', '>', '<', '~~', '  ', '--', '**', '.', ' ', '-', '*', '/', '%', '=~', '!~', '&&', '||', '//', '&', '&.', '|', '|.', '^', '^.', '<<', '>>', '..', '...', '?', ';', X_KEYWORD, AND, CMP, EQ, FOR, FOREACH, GE, GT, IF, ISA, LE, LT, OR, NE, UNLESS, UNTIL, WHEN, WHILE, XOR, UNSIGNED_INTEGER}

Note that it doesn't matter where the lexer rule REGEX_STRING is used, even if it is not present anywhere in the parser rules just being here makes the parsing fails (so the issue is lexer side). If I remove the lexer rule REGEX_STRING, then it gets parsed just fine, but then I can't parse :

$dateCalc =~ /^([0-9]{4})([0-9]{2})([0-9]{2})/

Also, I noticed that this perl parses, so there seems to be some kind of interaction between the first and the second '/'.

$b = 12;          # Removed the / between 1 and 2
$c = 1/2;         # Removing the / here would work as well.

I can't seem to find how to write my regex lexer rule to not make something fail. What am I missing ? How can I parse both expressions just fine ?

CodePudding user response:

The basic issue here is that ANTLR4, like many other parsing frameworks, performs lexical analysis independent of the syntax; the same tokens are produced regardless of which tokens might be acceptable to the parser. So it is the lexical analyser which must decide whether a given / is a division operator or the start of a regex, a decision which can really only be made using syntactic information. (There are parsing frameworks which do not have this limitation, and thus can be used to implement for scannerless parsers. These include PEG-based parsers and GLR/GLR parsers.)

There's an example of solving this lexical ambiguity, which also shows up in parsing ECMAScript, in the ANTLR4 example directory. (That's a github permalink so that the line numbers cited below continue to work.)

The basic strategy is to decide whether a / can start a regular expression based on the immediately previous token. This works in ECMAScript because the syntactic contexts in which an operator (such as / or /=) can appear are disjoint from the contexts in which an operand can appear. This will probably not translate directly into a Perl parser, but it might help show the possibilities.

Line 780-782: The regex token itself is protected by a semantic guard:

RegularExpressionLiteral
 : {isRegexPossible()}? '/' RegularExpressionBody '/' RegularExpressionFlags
 ;

Lines 154-182: The guard function itself is simple, but obviously required a certain amount of grammatical analysis to generate the correct test. (Note: The list of tokens has been abbreviated; see the original file for the complete list):

private boolean isRegexPossible() {
        if (this.lastToken == null) {
            return true;
        }

        switch (this.lastToken.getType()) {
            case Identifier:
            case NullLiteral:
...
                // After any of the tokens above, no regex literal can follow.
                return false;
            default:
                // In all other cases, a regex literal _is_ possible.
                return true;
        }
    }
}

Lines 127-147 In order for that to work, the scanner must retain the previous token in the member variable last_token. (Comments removed for space):

    @Override
    public Token nextToken() {
        Token next = super.nextToken();
        if (next.getChannel() == Token.DEFAULT_CHANNEL) {
            this.lastToken = next;
        }
        return next;
    }
  • Related