I'm having some issues allowing semicolons in strings in my ANTLR4 grammar.
My grammar should accept this:
prop_name@Default:'Building 3;100'
My grammar looks like this:
grammar BoitFilter;
filter : ';'* expression ( ';' expression )* ';'*;
expression : field boitOperator boitValueExpression;
field : ( parent '.' )? field_name;
parent : IDENTIFIER;
field_name : IDENTIFIER;
IDENTIFIER : [a-zA-Z0-9_@\[\]\.] ;
boitInOperator : ':(' boitValueExpression ( ',' boitValueExpression )* ')';
boitOperator : ( ':' | '<' | '>' | '<:' | '>:' | boitInOperator );
boitValueExpression : QUOTE boitValue QUOTE;
boitValue : VALUE_STRING_CHARACTER ;
VALUE_STRING_CHARACTER : [\ \:\;åäöÅÄÖa-zA-Z_0-9\*\-];
QUOTE : '\'';
I think my VALUE_STRING_CHARACTER grammar can be wrong, but I'm not sure why.
In my Java code I have a listener for boitValue:
@Override
public void enterBoitValue(BoitFilterParser.BoitValueContext ctx) {
String textValue = ctx.getText();
// Do something with the text
}
Here, I expect the textValue variable to be "'Building 3;100'", but instead its value is "'Building 3<missing '''>".
It seems like my grammar fails to accept the semicolon as part of the string.
Any idea what I might be doing wrong?
CodePudding user response:
First: don't use literal tokens inside your parser rules (unless you know exactly what is does). Your grammar is interpreted by ANTLR as follows:
grammar BoitFilter;
filter : SEMI* expression ( SEMI expression )* SEMI*;
expression : field boitOperator boitValueExpression;
field : ( parent DOT )? field_name;
parent : IDENTIFIER;
field_name : IDENTIFIER;
boitInOperator : COL_O_PAR boitValueExpression ( COMMA boitValueExpression )* C_PAR;
boitOperator : ( COL | LT | GT | LT_COL | GT_COL | boitInOperator );
boitValueExpression : QUOTE boitValue QUOTE;
boitValue : VALUE_STRING_CHARACTER ;
SEMI : ';';
DOT : '.';
COL_O_PAR : ':(';
COMMA : ',';
C_PAR : ')';
COL : ':';
LT : '<';
GT : '>';
LT_COL : '<:';
GT_COL : '>:';
IDENTIFIER : [a-zA-Z0-9_@[\].] ;
VALUE_STRING_CHARACTER : [ :;åäöÅÄÖa-zA-Z_0-9*\-];
QUOTE : '\'';
Of course, ANTLR will not name them like I did above, ANTLR will name them T__0
...T__10
instead, but that doesn't matter.
And if you now check what tokens are being created for your input prop_name@Default:'Building 3;100'
:
String source = "prop_name@Default:'Building 3;100'";
BoitFilterLexer lexer = new BoitFilterLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-25s'%s'\n", BoitFilterLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
you'll see this being printed:
IDENTIFIER 'prop_name@Default'
COL ':'
QUOTE '''
IDENTIFIER 'Building'
VALUE_STRING_CHARACTER ' '
IDENTIFIER '3'
SEMI ';'
IDENTIFIER '100'
QUOTE '''
EOF '<EOF>'
As you can see, the parser cannot create a boitValueExpression
from 'Building 3;100'
. This is because how ANTLR's lexer works:
- it tries to match as many characters as possible
- when 2 or more lexer rules match the same amount of characters, let the one defined first "win"
Because of rule 2, the input 3
will always become a IDENTIFIER
and never a VALUE_STRING_CHARACTER
token.
So if you want boitValue
to also match semi colons and characters that are in the IDENTIFIER
rule, you'll need to do that like this:
boitValue : ( VALUE_STRING_CHARACTER | IDENTIFIER | SEMI ) ;
which will then parse your example input like this: