Home > OS >  Why does the order of expressions matter in re.match?
Why does the order of expressions matter in re.match?

Time:12-01

I'm making a function that will take a string like "three()" or something like "{1 2}" and put them into a list of token (EX: "three()" = ["three", "(", ")"] I using the re.match to help separate the string.

def lex(s):
# scan input string and return a list of its tokens
seq = []
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/|)(\t|\n|\r| )*")
m = re.match(patterns,s)
while m != None:
    if s == '':
        break
    seq.append(m.group(2))
    s = s[len(m.group(0)):]
    m = re.match(patterns,s)
return seq

This one works if the string is just "three". But if the string contains "()" or any symbol it stays in the while loop. But a funny thing happens when move ([a-z])* in the pattern string it works. Why is that happening?

works: patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
Does not work: patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")

CodePudding user response:

This one is a bit tricky, but the problem is with this part ([a-z])*. This matches any string of lowercase letters size 0 (zero) or more.

If you put this sequence at the end, like here:

patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")

The regex engine will try the other matches first, and if it finds a match, stop there. Only if none of the others match, does it try ([a-z])* and since * is 'greedy', it will match all of three, then proceed to match ( and finally ).

Read an explanation of how the full expression is tested in the documentation (thanks to @kaya3).

However, if you put that sequence a the start, like here:

patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")

It will now try to match it first. It's still greedy, so three still gets matched. But then on the next try, it will try to match ([a-z])* to the remaining '()' - and it matches, since that string starts with zero letters.

It keeps matching it like that, and gets stuck in the loop. You can fix it by changing the * for a which will only match if there is 1 or more matches:

patterns = (r"^(\t|\n|\r| )*(([a-z]) |[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")
  • Related