ANTLR: Why is this grammar rule for a tuples not LL(1)?-CodePudding

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:

"Decision can match input such as "COMMA" using multiple alternatives: 1, 2

I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.

Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma

And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307

Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.

options {k=1; backtrack=no;}

tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';

DIGIT  : '0'..'9' ;
LOWER  : 'a'..'z' ;
UPPER  : 'A'..'Z' ;

IDENT  : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;

edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?

CodePudding user response：

Note:

The question has been edited since this answer was written. In the original, the grammar had the line:

tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';

and that's what this answer is referring to.

That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).

My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:

tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';

That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.

LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.

You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:

COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.

But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.

You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).

You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.

But, for what it's worth, here's a possible solution:

tuple  : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;

Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.

It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.