I am writing a parser (using PLY) for a subset of python. It includes: int
, name
, add
, unarysub
, assignments
, and function_calls
.
So far I have the following grammar:
# Grammar for P0:
# statements := statement
# statement := simple_statement
# simple_statement := assignment | d_expression
# assignment := NAME ['=' expression]
# d_expression := expression
# expression := sum
# sum := sum ' ' term | term
# term := factor
# factor := ' ' factor | '-' factor | primary
# primary: primary '(' [arguments] ') | atom
# atom := INT | NAME
# arguments := args | empty
# args := arg | arg ',' args
# arg := expression
It works as intended for pretty much all the cases I need. However, I am having trouble with discard expressions(d_expression
) above. Python supports discard expressions. For eg, python treats both x = 1 2
and 1 2
as valid program. The former gets an Assign node in python AST, while the latter gets an Expr node(which is discarded, hence the name discard expression).
I am trying to emulate this behavior in my parser, but when implementing this feature, my parser treats 1 - 2
as a list of two Expr nodes(Expr(Constant(value=1)
, Expr(UnaryOp(op=USub(), operand=Constant(value=2)
). This is an incorrect behavior and it should throw a syntax error as I don't have a expression MINUS expression
in my grammar. I am not sure where I am going wrong, any help is appreciated.
Below is the code implementation of my parser:
class MyParser:
tokens = MyLexer.tokens
precedence = (
('left', 'PLUS', 'MINUS'),
('right', 'UMINUS'),
)
def __init__(self):
self.lexer = P0Lexer()
self.parser = yacc.yacc(module=self)
def p_module(self, p):
'''
module : statements
'''
p[0] = Module(body=p[1])
logging.info(ast.dump(p[0]))
def p_statements(self, p):
'''
statements : statement
| statement statements
'''
if len(p) == 2:
p[0] = [p[1]]
else:
p[0] = [p[1]] p[2]
for i in p[0]:
logging.info(ast.dump(i))
def p_statement(self, p):
'''
statement : simple_statement
'''
p[0] = p[1]
logging.info(ast.dump(p[0]))
def p_simple_statement(self, p):
'''
simple_statement : assignment
'''
p[0] = p[1]
logging.info(ast.dump(p[0]))
def p_assignment(self, p):
'''
assignment : NAME EQUALS expression
'''
p[0] = Assign(targets=[Name(id=p[1], ctx=Store())], value=p[3])
logging.info(ast.dump(p[0]))
# def p_d_expression(self, p):
# '''
# d_expression : expression
# '''
# p[0] = Expr(value=p[1])
# logging.info(ast.dump(p[0]))
def p_expression(self, p):
'''
expression : sum
'''
p[0] = p[1]
logging.info(ast.dump(p[0]))
def p_sum(self, p):
'''
sum : sum PLUS term
| term
'''
if len(p) == 2:
p[0] = p[1]
else:
p[0] = BinOp(left=p[1], op=Add(), right=p[3])
logging.info(ast.dump(p[0]))
def p_term(self, p):
'''
term : factor
'''
p[0] = p[1]
logging.info(ast.dump(p[0]))
def p_factor(self, p):
'''
factor : MINUS factor %prec UMINUS
| primary
'''
if len(p) == 2:
p[0] = p[1]
else:
if p[1] == ' ':
p[0] = UnaryOp(op=UAdd(), operand=p[2])
else:
p[0] = UnaryOp(op=USub(), operand=p[2])
logging.info(ast.dump(p[0]))
def p_primary(self, p):
'''
primary : primary LPAREN arguments RPAREN
| atom
'''
if len(p) == 2:
p[0] = p[1]
else:
p[0] = Call(func=p[1], args=p[3])
logging.info(ast.dump(p[0]))
def p_arguments(self, p):
'''
arguments : args
| empty
'''
if len(p) == 2:
p[0] = p[1] if p[1] is not None else []
else:
p[0] = []
for i in p[0]:
logging.info(ast.dump(i))
def p_args(self, p):
'''
args : arg
| arg COMMA args
'''
if len(p) == 2:
p[0] = [p[1]]
else:
p[0] = [p[1]] p[3]
for i in p[0]:
logging.info(ast.dump(i))
def p_arg(self, p):
'''
arg : expression
'''
p[0] = p[1]
logging.info(ast.dump(p[0]))
def p_atom(self, p):
'''
atom : INT
| NAME
'''
if isinstance(p[1], int):
p[0] = Constant(value=p[1])
else:
p[0] = Name(id=p[1], ctx=Load())
logging.info(ast.dump(p[0]))
def p_empty(self, p):
'''
empty :
'''
pass
def p_error(self, p):
if p:
logging.error("Syntax error at '%s'" % p.value)
else:
logging.error("Syntax error at EOF")
exit(1)
CodePudding user response:
I suspect the issue here is that
statements := statement
is your culprit. This allows you to place any two statement
s side-by-side and have it be considered a statements
. If that's the case, then 1 - 2
is indeed a valid statements
, because, as you noted, this can be broken apart into the statements 1
and -2
, each of which is a discard-expression.
It might be helpful to look at other programming languages as an example to see how to resolve this. In languages like C, C , or Java, expressions can be made into statements because there's a grammar rule to the effect of
statement := expression ;
with the ;
serving as a separator. That means that if you were to find something like
1 2 2 3;
in a C/C /Java program, it would be a syntax error because while 1 2
and 2 3
are expressions, those expressions aren't statements without a semicolon after them. On the other hand, it would be legal in those languages to write
1 2; 2 3;
on one line, and there's a clear delimiter between the two expressions making up the two statements.
Now let's look at Python. Python will not allow you to write
1 2 2 3
and have it count as two expressions. Why not? Well, if you check out the Python grammar, you can see that in the way it parses a series of simple statements into a simple_stmts
expression. (Python uses a parsing expression grammar rather than a CFG, so the syntax is a bit different.)
simple_stmts:
| simple_stmt !';' NEWLINE
| ';'.simple_stmt [';'] NEWLINE
In other words, a simple_stmts
is either
- A
simple_stmt
followed by a newline, or - A series of
simple_stmt
s, separated by semicolons, and followed by a newline.
Each of these rules gives clear delimiters between expressions, either newlines or semicolons (or both).
So in this sense, I suspect the issue is fundamentally not one about drop expressions but rather about how you allow multiple statements to be separated from one another. Requiring some sort of delimiter - say, a newline or semicolon - after each expression used as a statement should fix this.
CodePudding user response:
Thanks to @templatetypdef. I was allowing two expression without a separator.
I solved it by introducing this production:
simple_stmts:
| simple_stmt !';' NEWLINE # Not needed, there for speedup
| ';'.simple_stmt [';'] NEWLINE