I am new to antlr, and I have been facing some issues when it comes to properly parsing the source code, this is my grammar:
compilationUnit
: (assignment | declarationList | definitionList)* EOF
;
block
: LC RC
;
assignment: typeSpecifier? IDENTIFIER '=' expression ';';
expression
: INTEGER
;
statementList
:
;
declarationList
: declaration
| declarationList declaration
;
declaration
: functionDeclaration SEMICOLON
;
functionDeclaration
: typeSpecifier? functionName functionArgs
;
definitionList
: functionDefinition
;
functionDefinition: functionDeclaration block;
functionName: IDENTIFIER;
functionArgs: LP RP;
typeSpecifier: VOID | INT;
TYPE_SPECIFIER
: VOID
| INT
;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*;
INTEGER: [1-9][0-9]*;
STRING_LITERAL: '"' ~('"')* '"';
VOID: 'void';
INT: 'int';
STAR: '*';
LP: '(';
RP: ')';
LC: '{';
RC: '}';
LSQRB: '[';
RSQRB: ']';
SEMICOLON: ';';
WS: [ \t\r\n] -> skip;
NEWLINE
: ( '\r' '\n'?
| '\n'
)
-> skip
;
BLOCK_COMMENT
: '/*' .*? '*/'
-> skip
;
LINE_COMMENT
: '//' ~[\r\n]*
-> skip
;
the problem is typeSpecifier does not get matched properly, unless I change it to a lexer rule, so if I input something like this:
void b();
int a = 1;
it returns:
line 1:0 extraneous input 'void' expecting {<EOF>, IDENTIFIER, 'void', 'int'}
line 2:0 extraneous input 'int' expecting {<EOF>, IDENTIFIER, 'void', 'int'}
but if I rename typeSpecifier to TYPE_SPECIFIER it parses it with no errors, the problem with that is lets say for assignment int a = 1
I cannot distinguish between nodes and terminal nodes, also same issue with identifiers, so it will return:
'int' = <class 'antlr4.tree.Tree.TerminalNodeImpl'>
'a' = <class 'antlr4.tree.Tree.TerminalNodeImpl'>
'=' = <class 'antlr4.tree.Tree.TerminalNodeImpl'>
'1' = <class 'core.CParser.CParser.ExpressionContext'>
';' = <class 'antlr4.tree.Tree.TerminalNodeImpl'>
and I want it to return something more like:
'int' = <class 'antlr4.tree.Tree.TypeSpecifier'>
'a' = <class 'antlr4.tree.Tree.Identifier'>
'=' = <class 'antlr4.tree.Tree.AssignEq'> #or something like that
'1' = <class 'core.CParser.CParser.ExpressionContext'>
';' = <class 'antlr4.tree.Tree.SemiCol'>
this is my python visitor code:
from core.CParser import CParser
from core.CListener import CListener
from io import FileIO
from antlr4.tree.Tree import TerminalNodeImpl
class Listener(CListener):
def __init__(self, output):
self.output: FileIO = output
def add_newline(self):
self.output.write('\n')
def enterDeclaration(self, ctx: CParser.DeclarationContext):
...
def enterFunctionDeclaration(self,
ctx: CParser.FunctionDeclarationContext):
for child in ctx.getChildren():
if isinstance(child, TerminalNodeImpl):
self.output.write(child.getText() ' ')
if isinstance(child, CParser.FunctionNameContext):
self.output.write(child.getText())
if isinstance(child, CParser.FunctionArgsContext):
self.output.write(child.getText())
self.output.write(';')
self.add_newline()
def enterAssignment(self, ctx: CParser.AssignmentContext):
for child in ctx.getChildren():
if isinstance(child, TerminalNodeImpl):
self.output.write(child.getText() ' ')
if isinstance(child, CParser.ExpressionContext):
self.output.write(child.getText())
self.add_newline()
def enterBlock(self, ctx: CParser.BlockContext):
print(ctx.getText())
Thank you in advance :)
CodePudding user response:
You have both of the following:
typeSpecifier: VOID | INT;
TYPE_SPECIFIER
: VOID
| INT
;
Yet you never used the TYPE_SPECIFIER
token in any of your parser rules. (And the TYPE_SPECIFIER
Lexer rule will be the token type assigned; you’ll never see a VOID
or INT
token with this rule in place. You’re effectively making them fragment rules.)
Delete the TYPE_SPECIFIER
Lexer rule (you had the right idea to begin with.)
However, you need to move your IDENTIFIER
rule below any keywords in your Lexer rule (in a “tie” for Lexer rules, ANTLR will use the first defined rule, so you’ll never see the keyword rules as your keywords will also match the, more generic, IDENTIFIERS rule).
—-
Also, it’s a really good idea to define a start rule that ends with EOF
to ensure all input is parsed.