Python parser ply does not handle spaces-CodePudding

I parse data using ply. I try to use a space as part of lexemes. Here a simplified example:

from ply.lex import lex
from ply.yacc import yacc

tokens = ('NUM', 'SPACE')

t_NUM = r'\d '
t_SPACE = r' '

def t_error(t):
    print(f'Illegal character {t.value[0]!r}')
    t.lexer.skip(1)

lexer = lex()

def p_two(p):
    '''
    two : NUM SPACE NUM
    '''
    p[0] = ('two', p[1], p[2], p[3])

def p_error(p):
    if p:
        print(f"Syntax error at '{p.value}'")
    else:
        print("Syntax error at EOF")

parser = yacc()

ast = parser.parse('1 2')
print(ast)

When I run, I've got the error:

ERROR: Regular expression for rule 't_SPACE' matches empty string
Traceback (most recent call last):
  File "c:\demo\simple_space.py", line 19, in <module>
    lexer = lex()
  File "C:\demo\3rdparty\ply\ply\lex.py", line 752, in lex
    raise SyntaxError("Can't build lexer")
SyntaxError: Can't build lexer

Is it possible to specify space as part of lexeme? A few additional possible tokens:

t_COMMENT = r' \#.*' for comment
t_DIVIDE = r': ' for a divider

CodePudding user response：

This is explained in the Ply manual section on Specification of tokens:

Internally, lex.py uses the re module to do its pattern matching. Patterns are compiled using the re.VERBOSE flag which can be used to help readability. However, be aware that unescaped whitespace is ignored and comments are allowed in this mode. If your pattern involves whitespace, make sure you use \s. If you need to match the # character, use [#].

So a literal space character must be written as [ ] or \ . (\s, as suggested in the manual, matches any whitespace, not just a space character.)

CodePudding user response：

I have no idea why it doesn't work

But it seems it works with sly created by the same author
(but few years later - so he could have experience after writing ply)

from sly import Lexer, Parser

class MyLexer(Lexer):
    tokens = { NUM, SPACE }

    NUM = r'\d '
    SPACE = r' '

    def error(self, t):
        print(f'Illegal character {t.value[0]!r}')
        t.lexer.skip(1)

class MyParser(Parser):
    tokens = MyLexer.tokens

    @_('NUM SPACE NUM')
    def two(self, p):
        return ('two', p.NUM0, p.SPACE, p.NUM1)
        

lexer = MyLexer()
parser = MyParser()

ast = parser.parse(lexer.tokenize('1 2'))
print(ast)

EDIT:

Interesting is text 't_SPACE' matches empty string and it suggested to me that space may have special meaning so I tested "\ " - and it works

t_SPACE = r'\ '