Semicolon in a string causes problems-CodePudding

I'm having some issues allowing semicolons in strings in my ANTLR4 grammar.

My grammar should accept this:

prop_name@Default:'Building 3;100'

My grammar looks like this:

grammar BoitFilter;

filter : ';'* expression ( ';'  expression )* ';'*;

expression : field boitOperator boitValueExpression;

field : ( parent '.' )? field_name;

parent : IDENTIFIER;

field_name : IDENTIFIER;

IDENTIFIER : [a-zA-Z0-9_@\[\]\.] ;

boitInOperator : ':(' boitValueExpression ( ','  boitValueExpression )* ')';

boitOperator : ( ':' | '<' | '>' | '<:' | '>:' | boitInOperator );

boitValueExpression : QUOTE boitValue QUOTE;

boitValue : VALUE_STRING_CHARACTER ;

VALUE_STRING_CHARACTER : [\ \:\;åäöÅÄÖa-zA-Z_0-9\*\-];

QUOTE : '\'';

I think my VALUE_STRING_CHARACTER grammar can be wrong, but I'm not sure why.

In my Java code I have a listener for boitValue:

@Override
public void enterBoitValue(BoitFilterParser.BoitValueContext ctx) {
    String textValue = ctx.getText();
    // Do something with the text
}

Here, I expect the textValue variable to be "'Building 3;100'", but instead its value is "'Building 3<missing '''>".

It seems like my grammar fails to accept the semicolon as part of the string.

Any idea what I might be doing wrong?

CodePudding user response：

First: don't use literal tokens inside your parser rules (unless you know exactly what is does). Your grammar is interpreted by ANTLR as follows:

grammar BoitFilter;

filter              : SEMI* expression ( SEMI  expression )* SEMI*;
expression          : field boitOperator boitValueExpression;
field               : ( parent DOT )? field_name;
parent              : IDENTIFIER;
field_name          : IDENTIFIER;
boitInOperator      : COL_O_PAR boitValueExpression ( COMMA  boitValueExpression )* C_PAR;
boitOperator        : ( COL | LT | GT | LT_COL | GT_COL | boitInOperator );
boitValueExpression : QUOTE boitValue QUOTE;
boitValue           : VALUE_STRING_CHARACTER ;

SEMI                   : ';';
DOT                    : '.';
COL_O_PAR              : ':(';
COMMA                  : ',';
C_PAR                  : ')';
COL                    : ':';
LT                     : '<';
GT                     : '>';
LT_COL                 : '<:';
GT_COL                 : '>:';
IDENTIFIER             : [a-zA-Z0-9_@[\].] ;
VALUE_STRING_CHARACTER : [ :;åäöÅÄÖa-zA-Z_0-9*\-];
QUOTE                  : '\'';

Of course, ANTLR will not name them like I did above, ANTLR will name them T__0...T__10 instead, but that doesn't matter.

And if you now check what tokens are being created for your input prop_name@Default:'Building 3;100':

String source = "prop_name@Default:'Building 3;100'";
BoitFilterLexer lexer  = new BoitFilterLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();

for (Token t : stream.getTokens()) {
  System.out.printf("%-25s'%s'\n", BoitFilterLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}

you'll see this being printed:

IDENTIFIER               'prop_name@Default'
COL                      ':'
QUOTE                    '''
IDENTIFIER               'Building'
VALUE_STRING_CHARACTER   ' '
IDENTIFIER               '3'
SEMI                     ';'
IDENTIFIER               '100'
QUOTE                    '''
EOF                      '<EOF>'

As you can see, the parser cannot create a boitValueExpression from 'Building 3;100'. This is because how ANTLR's lexer works:

it tries to match as many characters as possible
when 2 or more lexer rules match the same amount of characters, let the one defined first "win"

Because of rule 2, the input 3 will always become a IDENTIFIER and never a VALUE_STRING_CHARACTER token.

So if you want boitValue to also match semi colons and characters that are in the IDENTIFIER rule, you'll need to do that like this:

boitValue : ( VALUE_STRING_CHARACTER | IDENTIFIER | SEMI ) ;

which will then parse your example input like this: