I found an odd behavior in Antlr4 (I tried versions 4.10 and 4.10.1 with the same result).
When I try to use the grammar
grammar Paths;
cfg: NL? (entry (NL | EOF))* EOF;
entry: path ':' value;
path: SEGMENT ('.' SEGMENT)*;
value: USTRING;
SEGMENT: [a-zA-Z0-9] ;
USTRING: [a-zA-Z0-9] ;
NL: [\n\r] ;
WS: [ \t] -> skip;
on the string "key1:value1\nkey2.sub:value2\nkey3.sub1.sub2:value3"
, I see error messages:
line 1:5 mismatched input 'value1' expecting {':', '.'}
line 2:9 mismatched input 'value2' expecting {':', '.'}
line 3:15 mismatched input 'value3' expecting {':', '.'}
If I replace value
definition with value: SEGMENT
, everything works as expected.
What is wrong in the first definition?
The output of tree in both cases is the same:
(cfg (entry (path key1) : (value value1)) \n (entry (path key2 . sub) : (value value2)) \n (entry (path key3 . sub1 . sub2) : (value value3)) <EOF> <EOF>)
I tried to simplify the grammar:
grammar Paths;
cfg: NL? (entry (NL | EOF))* EOF;
entry: path ':' value;
path: SEGMENT;
value: USTRING;
SEGMENT: [a-zA-Z0-9] ;
USTRING: [a-zA-Z0-9] ;
NL: [\n\r] ;
WS: [ \t] -> skip;
In this case I have errors (the parsed string is "key1:value1\nkey2:value2\nkey3:value3"
):
line 1:5 mismatched input 'value1' expecting USTRING
line 2:5 mismatched input 'value2' expecting USTRING
line 3:5 mismatched input 'value3' expecting USTRING
And again everything is fine if I replace USTRING
with SEGMENT
in the value
definition.
Output is
(cfg (entry (path key1) : (value value1)) \n (entry (path key2) : (value value2)) \n (entry (path key3) : (value value3)) <EOF> <EOF>)
CodePudding user response:
It happens because Antlr assigns to values types '''SEGMENT'''. Lexer ignores grammar and if there are tokens that could be assigned to different types, lexer assigns them to a single random type.
This piece of code helped me:
val lexer = PathsLexer(CharStreams.fromString("key1:value1\nkey2.sub:value2\nkey3.sub1.sub2:value3"))
val tokens = CommonTokenStream(lexer)
tokens.fill()
println(tokens.getTokens.asScala.map { token =>
s"$token: ${token.getType} -> ${lexer.getVocabulary.getDisplayName(token.getType)}"
}.mkString("\n"))
Probably I need to learn more about Antlr modes.
CodePudding user response:
That’s because ANTLR’s Lexer processes the input stream of characters to produce a token stream, and the parser processes the token stream. The recursive descent parsing in ANTLR is processing the token stream and has no impact on how the Lexer views input.
The Lexer rules SEGMENT
and USTRING
are identical, as a result, both rules will match exactly the same run of input characters. When that happens, ANTLR will match the first rule, so they’ll all be `SEGMENTS tokens.
If you’ve run through the standard setup (and created the grun
alias you can run it with the `-tokens option to get a dump of your token stream. This is generally a good idea for validating that you Lexer rules are creating the token stream you expect.