Disambiguating a left-recursive ANTL4 rule-CodePudding

Let's consider this simple ANTL4 language grammar.

Lexer:

lexer grammar BiaLexer;

Lt                      : '<' ;
Gt                      : '>' ;
Identifier              : [a-zA-Z] ([a-zA-Z1-9] | ':')* ;
LeftParen               : '(' ;
RightParen              : ')' ;
Comma                   : ',' ;

Whitespace              : (' ' | '\n') -> skip ;

Parser:

parser grammar BiaParser;

options { tokenVocab = BiaLexer; }

typeExpression
    : referredName=Identifier # typeReference ;

expression
    : callee=expression callTypeVariableList LeftParen callArgumentList RightParen # callExpression
    | left=expression operator=Lt right=expression # binaryOperation
    | left=expression operator=Gt right=expression # binaryOperation
    | referredName=Identifier # reference
    | LeftParen expression RightParen # parenExpression ;

callTypeVariableList: Lt typeExpression (Comma typeExpression)* Gt ;

callArgumentList: (expression (Comma expression)*)? ;

So, basically, this language has only:

ordinary references, e.g. a
type references, e.g. A
comparisons, e.g. a < b or c > d
expressions wrapped in parenthesis, e.g. (a)
and, finally, generic function calls: e.g. f<A, B>(a, b) or f<A>(a) (similar to, let's say, Kotlin)

This grammar is ambiguous. A simple expression like f<A>(a) can be interpreted as...

...a generic call: Call(calle = ref:f, typeArgs = TypeArgs(typeRef:A), args = Args(ref:a))

...or a chain of comparisons between a reference, another reference and an parenthesised expression: Binary(op = >, left = Binary(op = <, left = ref:f, right = ref:A), right = Paren(ref:a))

The actual parser generated by ANTLR does the second, i.e. comparison chain. If I comment-out the binary operation rules...

//    | left=expression operator=Lt right=expression # binaryOperation
//    | left=expression operator=Gt right=expression # binaryOperation

...then the result is, as expected by me, the generic call.

Please note that I've, on purpose, put the #callExpression case on the top of the expression rule, with an intention of declaring that it has higher precedence than the comparison cases below. I believed that that's how one declares case precedence in ANTLR, but obviously it doesn't work in this case.

Questions:

why does ANTLR interpret f<A>(a) as a chain of comparisons?
how can I fix that, i.e. make the generic call have precedence over comparison chain?

If that matters, I can provide the code I've used to dump the AST to a pretty-string, but that's just a simple ANTLR visitor emitting a string. I've skipped it for readability.

CodePudding user response：

I took a look at the ANTLR grammars for Swift and Rust. Both of them allowed only for some sort of identifier to precede the generic type specification (i.e. they did not allow for any expression to be used as a callee).

Using that approach, something like this parses your input just fine:

grammar Bia
    ;

typeExpression: referredName = Identifier # typeReference;

expression
    : callee=Identifier callTypeVariableList LeftParen callArgumentList RightParen # callExpr
    | left = expression operator = (Lt | Gt) right = expression             # binaryExpression
    | Identifier                                                            # reference
    | LeftParen expression RightParen                                       # parenExpression
    ;

callTypeVariableList
    : Lt typeExpression (Comma typeExpression)* Gt
    ;

callArgumentList: (expression (Comma expression)*)?;

Lt:         '<';
Gt:         '>';
Identifier: [a-zA-Z] ([a-zA-Z1-9] | ':')*;
LeftParen:  '(';
RightParen: ')';
Comma:      ',';

Whitespace: (' ' | '\n') -> skip;

You might find that you want a rule that is a bit more flexible about the sort of callee identifiers you want to allow, without it being just ANY sort of expression (There's probably a good argument that the boolean result of a < or > expression couldn't really serve as a callee anyway).

The following allows for much more flexibility and still correctly matches your expression:

grammar Bia
    ;

typeExpression: referredName = Identifier # typeReference;

expression
    : callExpression                                            # callExpr
    | left = expression operator = (Lt | Gt) right = expression # binaryExpression
    | Identifier                                                # reference
    | LeftParen expression RightParen                           # parenExpression
    ;

callExpression
    : callee = calleeIdentifier (callTypeVariableList)? LeftParen callArgumentList RightParen # idCall
    | callee = callExpression (callTypeVariableList)? LeftParen callArgumentList RightParen # exprCall
    ;

callTypeVariableList
    : Lt typeExpression (Comma typeExpression)* Gt
    ;

calleeIdentifier: Identifier ('.' Identifier)*;

callArgumentList: (expression (Comma expression)*)?;

Lt:         '<';
Gt:         '>';
Identifier: [a-zA-Z] ([a-zA-Z1-9] | ':')*;
LeftParen:  '(';
RightParen: ')';
Comma:      ',';

Whitespace: (' ' | '\n') -> skip;

NOTE: I also tried kaby76's suggestion, and it does handle your situation. You might find the resulting context class a bit awkward though (as there will be a single rule alternative that matches either a call of an LT expression).

CodePudding user response：

To answer the first (my own) subquestion, I still don't know why doesn't the first #callExpression alternative "get picked" by ANTLR in the original grammar. This comment by kaby76 makes an educated guess that sounds reasonable.

Mike Cargal's answer solves the problem as described in the question very well. Building on top of it, I've adjusted the grammar so it also handles function call on parenthesised expressions (like (a)<A>(b) or (a b)<A>(c)).

The slight difference in the approach is that, in my case, I have a separate rule for a "callable expression" (an expression that can be called), not for the call expression itself. Still, as you can "call the call" in this adjusted grammar, the call expression is an alternative in this rule.

The modified parser grammar looks like this:

parser grammar BiaParser;

options { tokenVocab = BiaLexer; }

typeExpression
    : referredName=Identifier # typeReference ;

expression
    : callableExpression # callableExpression_
    | left=expression operator=Lt right=expression # binaryOperation
    | left=expression operator=Gt right=expression # binaryOperation ;

callableExpression
    : LeftParen expression RightParen # parenExpression
    | callee=callableExpression callTypeVariableList? LeftParen callArgumentList RightParen # callExpression
    | referredName=Identifier # reference ;

callTypeVariableList: Lt typeExpression (Comma typeExpression)* Gt ;

callArgumentList: (expression (Comma expression)*)? ;

It allows the following kind of generic calls:

call on identifier, e.g. f<A>(a)
call on any parenthesised expression
call on call, e.g. f<A>(a, b)<B, C>(c, d, e)

I've verified with tests that all the above cases are parsed as expected.

One interesting thing to note is that, as far as I can see, this adjusted grammar doesn't really limit the programmer in any way, compared to the original grammar. It's difficult to reason about what it would mean to call a < / > expression directly, as even the original grammar (by intention and ordering) considered a < b<B>(c) to be a lt-comparison of reference a and b<B>(c) generic call (and that's what most programmers would, probably, expect). This (probably) generalises to all kind of binary operator (e.g. , -) and possibly more kind of expressions appearing in general-purpose languages.