How to match [BOF]"Begin of file" in Antlr4 Lexer?-CodePudding

In one Antlr4 syntax, I need the comment (// xxxx) to be always at the start of a line. The following grammar works fine for most cases.

grammar com;

comment: COMMENT;

COMMENT
 : '\n' '//' .*? '\n'
 ;

By design, it will match \n//comment\n but not //comment\n. But I also want it to match <BOF>//comment\n. How can I implement it?

CodePudding user response：

That is not possible in a language agnostic way. You will have to add target specific code in your grammar and use a predicate to check if the char position is 0:

COMMENT
 : {getCharPositionInLine() == 0}? '//' ~[\r\n]*
 ;

OTHER
 : .
 ;

If you now tokenize the input:

// start
// middle
?//...
// end

with the Java code:

String input = "// start\n// middle\n?//...\n// end";
comLexer lexer = new comLexer(CharStreams.fromString(input));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();

for (Token t : stream.getTokens()) {
    System.out.printf("%-10s'%s'%n",
        comLexer.VOCABULARY.getSymbolicName(t.getType()),
        t.getText().replace("\n", "\\n"));
}

the following will be printed to your console:

COMMENT   '// start'
OTHER     '\n'
COMMENT   '// middle'
OTHER     '\n'
OTHER     '?'
OTHER     '/'
OTHER     '/'
OTHER     '.'
OTHER     '.'
OTHER     '.'
OTHER     '\n'
COMMENT   '// end'
EOF       '<EOF>'

Note that I also removed the \n at the end of the COMMENT, otherwise a comment at the end of the input would not be matched.

EDIT

How I can do it with JavaScript? I cannot find good examples on internet.

By looking at the Javascript source, it looks like {column() == 0}? is the Javascript equivalent of {getCharPositionInLine() == 0}?

By the way, does the Intellij Plugin support predict? If it does, does it support only Java?

No, the IntelliJ plugin ignores predicates. After all, the code inside a predicate can be any arbitrary chunk of code, making it quite hard to support.

CodePudding user response：

You may find that this edit is better handled post-parsing, in a semantic validation pass of your parseTree. (NOTE: It's not a requirement that a parser ONLY recognize valid input, just that it correctly interprets the only way to understand that input.)

For example, does // might be a comment have some other, alternate interpretation if it's not at the beginning of the line?

If not, I would probably just accept the // comment ...\n as a token regardless of it's position in the line.

Then, once you have the parse tree, you can check that you comments always have a column of 0. Doing it this way, your grammar is not tied to a particular target language, and, perhaps more importantly, you can give a "nice" error message like "Comments must begin in the first column of a line".

If you try to handle this in the Lexer (or parser), then, if it's NOT in the correct column, you'll get a much more obtuse recognition error that will be more difficult for users to understand.