I'm trying to extract these two dates from this string:

re.findall(r"(?:^|[^0-9])(1[012]|0?[1-9])[/-](1[0-9]|2[0-9]|3[01]|0?[1-9])[/-](\d{2})(?:$| )", 'For the period from 7/1/22 through 8/31/22', flags=18)
[('7', '1', '22'), ('8', '31', '22')]```

Why don't I get that result when I use this regex (with slightly modified ending?

re.findall(r"(?:^|[^0-9])(1[012]|0?[1-9])[/-](1[0-9]|2[0-9]|3[01]|0?[1-9])[/-](\d{2})(?:$|[^0-9] )", 'For the period from 7/1/22 through 8/31/22', flags=18)
[('7', '1', '22')]

I know I'm missing something silly here; thanks in advance for your help!

CodePudding user response：

The second regex doesn't match only the parts 7/1/22 and 8/31/22. It matches 7/1/22 through and 8/31/22. They do overlap (the one space character) and re.findall returns only non-overlapping matches.

To fix that, I would propose to use lookahead and lookbehind. You can change the beginning of the regex from (?:^|[^0-9]) to (?:^|(?<=[^0-9])) and the end from (?:$|[^0-9] ) to (?=$|[^0-9] ). This way the matches won't contain additional characters and they won't overlap anymore.

More about lookahead and lookbehind: Regex lookahead, lookbehind and atomic groups

CodePudding user response：

Regex are difficult to read, debug and maintain. Whenever possible, I'd rather trade performance for readability.

In this case that would mean using a much simpler regex, and writing an additional function to unload the burden of checking whether the date is valid or not.

import re

pattern = re.compile(r'(\d{1,2})/(\d{1,2})/(\d{1,4})')

def is_valid_date(m: re.Match):
    month = int(m.group(1))
    day = int(m.group(2))
    year = int(m.group(3))
    b1 = day > 0 and day <= 31
    b2 = month > 0 and month <= 12
    b3 = year > 0
    return b1 and b2 and b3

It gives you:

s = "For the period from 7/1/22 through 8/31/22, and 13/2/24 is not a valid date"

dates = [m for m in pattern.finditer(s) if is_valid_date(m)]

>>> [<re.Match object; span=(20, 26), match='7/1/22'>, <re.Match object; span=(35, 42), match='8/31/22'>]

If you want to output a list of tuples:

dates_tp = [(m.group(1), m.group(2), m.group(3)) for m in dates]
>>> [('7', '1', '22'), ('8', '31', '22')]

edit

Actually it looks like this method is also slightly more performant:

pat_1 = r"(?:^|[^0-9])(1[012]|0?[1-9])[/-](1[0-9]|2[0-9]|3[01]|0?[1-9])[/-](\d{2})(?:$| )"
def method_1(s):
    res = re.findall(pat_1, s, flags=18)
    return res

pat_2 = re.compile(r'(\d{1,2})/(\d{1,2})/(\d{1,4})')
def is_valid_date(m: re.Match):
    month = int(m.group(1))
    day = int(m.group(2))
    year = int(m.group(3))
    b1 = day > 0 and day <= 31
    b2 = month > 0 and month <= 12
    b3 = year > 0
    return b1 and b2 and b3
def method_2(s):
    dates_tp = [(m.group(1), m.group(2), m.group(3)) for m in pat_2.finditer(s) if is_valid_date(m)]
    return dates_tp

import timeit

timeit.timeit("method_1(s)", globals=globals(), number=1000000)
>>> 5.62

timeit.timeit("method_2(s)", globals=globals(), number=1000000)
>>> 4.85