Include commentary to the matlab grammar using antlr4-CodePudding

could anybody help me with these two problems please?

First one is almost solved for me by question regular expression for multiline commentary in matlab , but I do not know how exactly I should use ^.*%\{(?:\R(?!.*%\{).*)*\R\h*%\}$ or where in grammar if I want use is with antlr4. I have been using matlab grammar from this source.

Second one is related to another type of commentary in matlab which is a = 3 % type any ascii I want.... In this case worked, when I insert label alternative to the rule context unary_expression in this form:

unary_expression
: postfix_expression
| unary_operator postfix_expression
| postfix_expression COMMENT
;

where COMMENT: '%' [ a-zA-Z0-9]*;, but when I use [\x00-\x7F] instead of [ a-zA-Z0-9]* (what I found here) parsing goes wrong, see example bellow:

INPUT FOR PARSER: a = 3 %  $£ K JFKL£J"!"OIJ 2432 3K3KJ£$K M£"Kdsa
ANTLR OUTPUT : Exception in thread "main" java.lang.RuntimeException: set is empty
               at org.antlr.v4.runtime.misc.IntervalSet.getMaxElement(IntervalSet.java:421)
               at org.antlr.v4.runtime.atn.ATNSerializer.serialize(ATNSerializer.java:169)
               at org.antlr.v4.runtime.atn.ATNSerializer.getSerialized(ATNSerializer.java:601)
               at org.antlr.v4.Tool.generateInterpreterData(Tool.java:745)
               at org.antlr.v4.Tool.processNonCombinedGrammar(Tool.java:400)
               at org.antlr.v4.Tool.process(Tool.java:361)
               at org.antlr.v4.Tool.processGrammarsOnCommandLine(Tool.java:328)
               at org.antlr.v4.Tool.main(Tool.java:172)
               line 1:9 token recognition error at: '$'
               line 1:20 token recognition error at: '"'
               line 1:21 token recognition error at: '!'
               line 1:22 token recognition error at: '"'
               line 1:38 token recognition error at: '$'
               line 1:43 token recognition error at: '"'
               line 1:10 missing {',', ';', CR} at 'L'
               line 1:32 missing {',', ';', CR} at '3'

Can anybody please tell me what have I done wrong? And what is the best practice for this problem? (I am not exactly regex person...)

CodePudding user response：

Let's take the simple one first.

this looks (to me) like a typical "comment everything through the end of the line" comment.

Assuming I'm correct, then best not to consider what all the valid characters are that might be contained, but rather to think about what not to consume.

Try: COMMENT: '%' ~[\r\n]* '\r'? '\n';

(I notice that you did not include anything in your rule to terminate it at the end of the line, so I've added that).

This basically says: once I see a % consume everything that is not a \r or `nand stop when you see an option\rfollowed by a required\n'.

Generally, comments can occur just about anywhere within a grammar structure, so it's VERY useful to "shove the off to the side" rather than inject them everywhere you allow them in the grammar.

So, a short grammar:

grammar test
    ;

test: ID EQ INT;

EQ:      '=';
INT:     [0-9] ;
COMMENT: '%' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN);
ID:      [a-zA-Z] ;
WS:      [ \t\r\n]  -> skip;

You'll notice that I removed the COMMENT element from the test rule.

test file:

a = 3 %  $£ K JFKL£J"!"OIJ 2432 3K3KJ£$K M£"Kdsa

(be sure to include the \n)

➜ grun test test -tree -tokens < test.txt
[@0,0:0='a',<ID>,1:0]
[@1,2:2='=',<'='>,1:2]
[@2,4:4='3',<INT>,1:4]
[@3,6:48='%  $£ K JFKL£J"!"OIJ 2432 3K3KJ£$K M£"Kdsa\n',<COMMENT>,channel=1,1:6]
[@4,49:48='<EOF>',<EOF>,2:0]
(test a = 3)

You still get a COMMENT token, it's just ignored when matching the parser rules.

Now for the multiline comments:

ANTLR uses a rather "regex-like" syntax for Lexer rules, but, don't be fooled, it's not (it's actually more powerful as it can pair up nested brackets, etc.)

From a quick reading, MatLab multiline tokens start with a %{ and consume everything until a %}. This is very similar to the prior rule, it just doesn't care about \ror\n`), so:

MLCOMMENT: '%{' .*? '%}'    -> channel(HIDDEN);

Included in grammar:

grammar test
    ;

test: ID EQ INT;

EQ:        '=';
INT:       [0-9] ;
COMMENT:   '%' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN);
MLCOMMENT: '%{' .*? '%}'           -> channel(HIDDEN);
ID:        [a-zA-Z] ;
WS:        [ \t\r\n]  -> skip;

Input file:

a = 3 %  $£ K JFKL£J"!"OIJ 2432 3K3KJ£$K M£"Kdsa

%{
    A whole bunch of stuff
    on several
    lines
%}

➜ grun test test -tree -tokens < test.txt
[@0,0:0='a',<ID>,1:0]
[@1,2:2='=',<'='>,1:2]
[@2,4:4='3',<INT>,1:4]
[@3,6:48='%  $£ K JFKL£J"!"OIJ 2432 3K3KJ£$K M£"Kdsa\n',<COMMENT>,channel=1,1:6]
[@4,50:106='%{\n    A whole bunch of stuff\n    on several\n    lines\n%}',<MLCOMMENT>,channel=1,3:0]
[@5,108:107='<EOF>',<EOF>,8:0]
(test a = 3)