ANTLR4 Assembler Language Parser - issues - miscellaneous comments-CodePudding

I am trying to write a parser for the IBM Assembler Language, Example below.

Comment lines start with a star* at the first character, however there are 2 problems

Beyond a set point in the line there can also be descriptive text, but there is no star* neccessary.

The descriptive can/does contain lexer tokens, such as ENTRY or INPUT.....

*        TYPE.                                                          
ARG      DSECT                                                           
NXENT    DS    F                    some comment text ENTRY NUMBER             
NMADR    DS    F                    some comment text INPUT NAME           
NAADR    DS    F                    some comment text                    
NATYP    DS    F                    some comment text           
NAENT    DS    F                    some comment text          
         ORG   NATYP                some comment text

In my lexer I have devised the following, which works absolutley fine:

fragment CommentLine: Star {getCharPositionInLine() == 1}? .*? Nl 
   ;
fragment Star:                '*';
fragment Nl: '\r'? '\n' ;
COMMENT_LINE
   : CommentLine -> channel (COMMENT)
   ;

My question is how do I manage the line comments starting at a particular char position in the parser grammer? I.e. Parser -> NAME DS INT? LETTER ??????????

CodePudding user response：

Sending comments to a COMMENT channel (or -> skiping them) is a technique used to avoid having to define all the places comments are valid in your parser rules.

(Old 360 Assembler programmer here)

Since there are not really ways to place arbitrarily positioned comments in Assembler source, you don't really need to deal with shunting them off to the side. Actually because of the way comments are handled in assembler source, there's just NOT a way to identify them in a Lexer rule.

Since it can be a parser rule, you could set up a rule like:

trailingComment: (ID | STRING | NUMBER)* EOL;

where ID, STRING, NUMBER, etc. are just the tokens in your lexer (You'd need to include pretty much all of them... a good argument, for not getting down to tokens for MVC, CLC, CLI, (all the op codes... the path to madness). And of course EOL is your rule to match end of line (probably '\r?\n')

You would then end each of your rules for parsing a line that can contain a trailing comment (pretty much all of them) with the trailingComment rule.

CodePudding user response：

I was heading down the path described by Mike Lischke before making my post and I had previously stumbled across the technique you mentioed as well, but thought i would canvas for other opinions. If I am honest I prefer your approach, as I would then get the advantage of the grammer visitor and listener events. I want the grammer to have some level of intelligence, but MikeLs technique of chunking up the line puts the intelligence up into the application, unless I am misunderstanding?