I parse data using ply. I try to use a space as part of lexemes. Here a simplified example:
from ply.lex import lex
from ply.yacc import yacc
tokens = ('NUM', 'SPACE')
t_NUM = r'\d '
t_SPACE = r' '
def t_error(t):
print(f'Illegal character {t.value[0]!r}')
t.lexer.skip(1)
lexer = lex()
def p_two(p):
'''
two : NUM SPACE NUM
'''
p[0] = ('two', p[1], p[2], p[3])
def p_error(p):
if p:
print(f"Syntax error at '{p.value}'")
else:
print("Syntax error at EOF")
parser = yacc()
ast = parser.parse('1 2')
print(ast)
When I run, I've got the error:
ERROR: Regular expression for rule 't_SPACE' matches empty string
Traceback (most recent call last):
File "c:\demo\simple_space.py", line 19, in <module>
lexer = lex()
File "C:\demo\3rdparty\ply\ply\lex.py", line 752, in lex
raise SyntaxError("Can't build lexer")
SyntaxError: Can't build lexer
Is it possible to specify space as part of lexeme? A few additional possible tokens:
t_COMMENT = r' \#.*'
for commentt_DIVIDE = r': '
for a divider
CodePudding user response:
This is explained in the Ply manual section on Specification of tokens:
Internally, lex.py uses the re module to do its pattern matching. Patterns are compiled using the
re.VERBOSE
flag which can be used to help readability. However, be aware that unescaped whitespace is ignored and comments are allowed in this mode. If your pattern involves whitespace, make sure you use\s
. If you need to match the#
character, use[#]
.
So a literal space character must be written as [ ]
or \
. (\s
, as suggested in the manual, matches any whitespace, not just a space character.)
CodePudding user response:
I have no idea why it doesn't work
But it seems it works with sly created by the same author
(but few years later - so he could have experience after writing ply
)
from sly import Lexer, Parser
class MyLexer(Lexer):
tokens = { NUM, SPACE }
NUM = r'\d '
SPACE = r' '
def error(self, t):
print(f'Illegal character {t.value[0]!r}')
t.lexer.skip(1)
class MyParser(Parser):
tokens = MyLexer.tokens
@_('NUM SPACE NUM')
def two(self, p):
return ('two', p.NUM0, p.SPACE, p.NUM1)
lexer = MyLexer()
parser = MyParser()
ast = parser.parse(lexer.tokenize('1 2'))
print(ast)
EDIT:
Interesting is text 't_SPACE' matches empty string
and it suggested to me that space
may have special meaning so I tested "\ "
- and it works
t_SPACE = r'\ '