I'm making a function that will take a string like "three()" or something like "{1 2}" and put them into a list of token (EX: "three()" = ["three", "(", ")"] I using the re.match to help separate the string.
def lex(s):
# scan input string and return a list of its tokens
seq = []
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/|)(\t|\n|\r| )*")
m = re.match(patterns,s)
while m != None:
if s == '':
break
seq.append(m.group(2))
s = s[len(m.group(0)):]
m = re.match(patterns,s)
return seq
This one works if the string is just "three". But if the string contains "()" or any symbol it stays in the while loop. But a funny thing happens when move ([a-z])* in the pattern string it works. Why is that happening?
works: patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
Does not work: patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")
CodePudding user response:
This one is a bit tricky, but the problem is with this part ([a-z])*
. This matches any string of lowercase letters size 0 (zero) or more.
If you put this sequence at the end, like here:
patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
The regex engine will try the other matches first, and if it finds a match, stop there. Only if none of the others match, does it try ([a-z])*
and since *
is 'greedy', it will match all of three
, then proceed to match (
and finally )
.
Read an explanation of how the full expression is tested in the documentation (thanks to @kaya3).
However, if you put that sequence a the start, like here:
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")
It will now try to match it first. It's still greedy, so three
still gets matched. But then on the next try, it will try to match ([a-z])*
to the remaining '()'
- and it matches, since that string starts with zero letters.
It keeps matching it like that, and gets stuck in the loop. You can fix it by changing the *
for a
which will only match if there is 1 or more matches:
patterns = (r"^(\t|\n|\r| )*(([a-z]) |[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")