Home > OS >  Python Regex Nuance with Non-Capturing Group Containing $
Python Regex Nuance with Non-Capturing Group Containing $

Time:09-26

I'm trying to extract these two dates from this string:

re.findall(r"(?:^|[^0-9])(1[012]|0?[1-9])[/-](1[0-9]|2[0-9]|3[01]|0?[1-9])[/-](\d{2})(?:$| )", 'For the period from 7/1/22 through 8/31/22', flags=18)
[('7', '1', '22'), ('8', '31', '22')]```

Why don't I get that result when I use this regex (with slightly modified ending?

re.findall(r"(?:^|[^0-9])(1[012]|0?[1-9])[/-](1[0-9]|2[0-9]|3[01]|0?[1-9])[/-](\d{2})(?:$|[^0-9] )", 'For the period from 7/1/22 through 8/31/22', flags=18)
[('7', '1', '22')]

I know I'm missing something silly here; thanks in advance for your help!

CodePudding user response:

The second regex doesn't match only the parts 7/1/22 and 8/31/22. It matches 7/1/22 through and 8/31/22. They do overlap (the one space character) and re.findall returns only non-overlapping matches.

To fix that, I would propose to use lookahead and lookbehind. You can change the beginning of the regex from (?:^|[^0-9]) to (?:^|(?<=[^0-9])) and the end from (?:$|[^0-9] ) to (?=$|[^0-9] ). This way the matches won't contain additional characters and they won't overlap anymore.

More about lookahead and lookbehind: Regex lookahead, lookbehind and atomic groups

CodePudding user response:

Regex are difficult to read, debug and maintain. Whenever possible, I'd rather trade performance for readability.

In this case that would mean using a much simpler regex, and writing an additional function to unload the burden of checking whether the date is valid or not.

import re

pattern = re.compile(r'(\d{1,2})/(\d{1,2})/(\d{1,4})')

def is_valid_date(m: re.Match):
    month = int(m.group(1))
    day = int(m.group(2))
    year = int(m.group(3))
    b1 = day > 0 and day <= 31
    b2 = month > 0 and month <= 12
    b3 = year > 0
    return b1 and b2 and b3

It gives you:

s = "For the period from 7/1/22 through 8/31/22, and 13/2/24 is not a valid date"

dates = [m for m in pattern.finditer(s) if is_valid_date(m)]

>>> [<re.Match object; span=(20, 26), match='7/1/22'>, <re.Match object; span=(35, 42), match='8/31/22'>]

If you want to output a list of tuples:

dates_tp = [(m.group(1), m.group(2), m.group(3)) for m in dates]
>>> [('7', '1', '22'), ('8', '31', '22')]

edit

Actually it looks like this method is also slightly more performant:

pat_1 = r"(?:^|[^0-9])(1[012]|0?[1-9])[/-](1[0-9]|2[0-9]|3[01]|0?[1-9])[/-](\d{2})(?:$| )"
def method_1(s):
    res = re.findall(pat_1, s, flags=18)
    return res

pat_2 = re.compile(r'(\d{1,2})/(\d{1,2})/(\d{1,4})')
def is_valid_date(m: re.Match):
    month = int(m.group(1))
    day = int(m.group(2))
    year = int(m.group(3))
    b1 = day > 0 and day <= 31
    b2 = month > 0 and month <= 12
    b3 = year > 0
    return b1 and b2 and b3
def method_2(s):
    dates_tp = [(m.group(1), m.group(2), m.group(3)) for m in pat_2.finditer(s) if is_valid_date(m)]
    return dates_tp
import timeit

timeit.timeit("method_1(s)", globals=globals(), number=1000000)
>>> 5.62

timeit.timeit("method_2(s)", globals=globals(), number=1000000)
>>> 4.85
  • Related