Antlr4: Can't understand why breaking something out into a subrule doesn't work-CodePudding

I'm still new at Antlr4, and I have what is probably a really stupid problem.

Here's a fragment from my .g4 file:

assignStatement
    : VariableName '=' expression ';'
    ;

expression
    :   (value | VariableName)
        | bin_op='(' expression ')'
        | expression UNARY_PRE_OR_POST
        | (UNARY_PRE_OR_POST | ' ' | '-' | '!' | '~' | type_cast) expression
        | expression MUL_DIV_MOD expression
        | expression ADD_SUB expression
    ;

VariableName
    : ( [a-z] [A-Za-z0-9_]* )
    ;

// Pre or post increment/decrement
UNARY_PRE_OR_POST
    : '  ' | '--'
    ;

// multiply, divide, modulus
MUL_DIV_MOD
    : '*' | '/' | '%'
    ;

// Add, subtract
ADD_SUB
    : ' ' | '-'
    ;

And my sample input:

myInt = 10   5;
myInt = 10 - 5;
myInt = 1   2   3;
myInt = 1   (2   3);
myInt = 1   2 * 3;
myInt =   yourInt;
yourInt = (10 - 5)--;

The first sample line myInt = 10 5; line produces this error:

line 22:11 mismatched input ' ' expecting ';'
line 22:14 extraneous input ';' expecting {<EOF>, 'class', '{', 'interface', 'import', 'print', '[', '_', ClassName, VariableName, LITERAL, STRING, NUMBER, NUMERIC_LITERAL, SYMBOL}

I get similar issues with each of the lines.

If I make one change, a whole bunch of errors disappear:

        | expression ADD_SUB expression

change it to this:

        | expression (' ' | '-') expression

I've tried a bunch of things. I've tried using both lexer and parser rules (that is, calling it add_sub or ADD_SUB). I've tried a variety of combinations of parenthesis.

I tried:

ADD_SUB: [ -];

What's annoying is the pre- and post-increment lines produce no errors as long as I don't have errors due to -*. Yet they rely on UNARY_PRE_OR_POST. Of course, maybe it's not really using that and it's using something else that just isn't clear to me.

For now, I'm just eliminating the subrule syntax and will embed everything in the main rule. But I'd like to understand what's going on.

So... what is the proper way to do this:

CodePudding user response：

Do not use literal tokens inside parser rules (unless you know what you're doing).

For the grammar:

expression
    :   ' ' expression
    |   ...
    ;

ADD_SUB
    : ' ' | '-'
    ;

ANTLR will create a lexer rules for the literal ' ', making the grammar really look like this:

expression
    :   T__0 expression
    |   ...
    ;

T__0 : ' ';

ADD_SUB
    : ' ' | '-'
    ;

causing the input to never become a ADD_SUB token because T__0 will always match it first. That is simply how the lexer operates: try to match as much characters as possible for every lexer rule, and when 2 (or more) match the same amount of characters, let the one defined first "win".

Do something like this instead:

expression
    : value
    | '(' expression ')'
    | expression UNARY_PRE_OR_POST
    | (UNARY_PRE_OR_POST | ADD | SUB | EXCL | TILDE | type_cast) expression
    | expression (MUL | DIV | MOD) expression
    | expression (ADD | SUB) expression
    ;

value
    : ...
    | VariableName
    ;

VariableName
    : [a-z] [A-Za-z0-9_]*
    ;

UNARY_PRE_OR_POST
    : '  ' | '--'
    ;

MUL : '*';
DIV : '/';
MOD : '%';
ADD : ' ';
SUB : '-';
EXCL : '!';
TILDE : '~';