I'm trying to extract these two dates from this string:
re.findall(r"(?:^|[^0-9])(1[012]|0?[1-9])[/-](1[0-9]|2[0-9]|3[01]|0?[1-9])[/-](\d{2})(?:$| )", 'For the period from 7/1/22 through 8/31/22', flags=18)
[('7', '1', '22'), ('8', '31', '22')]```
Why don't I get that result when I use this regex (with slightly modified ending?
re.findall(r"(?:^|[^0-9])(1[012]|0?[1-9])[/-](1[0-9]|2[0-9]|3[01]|0?[1-9])[/-](\d{2})(?:$|[^0-9] )", 'For the period from 7/1/22 through 8/31/22', flags=18)
[('7', '1', '22')]
I know I'm missing something silly here; thanks in advance for your help!
CodePudding user response:
The second regex doesn't match only the parts 7/1/22
and 8/31/22
. It matches 7/1/22 through
and 8/31/22
.
They do overlap (the one space character) and re.findall
returns only non-overlapping matches.
To fix that, I would propose to use lookahead and lookbehind.
You can change the beginning of the regex from (?:^|[^0-9])
to (?:^|(?<=[^0-9]))
and the end from (?:$|[^0-9] )
to (?=$|[^0-9] )
.
This way the matches won't contain additional characters and they won't overlap anymore.
More about lookahead and lookbehind: Regex lookahead, lookbehind and atomic groups
CodePudding user response:
Regex are difficult to read, debug and maintain. Whenever possible, I'd rather trade performance for readability.
In this case that would mean using a much simpler regex, and writing an additional function to unload the burden of checking whether the date is valid or not.
import re
pattern = re.compile(r'(\d{1,2})/(\d{1,2})/(\d{1,4})')
def is_valid_date(m: re.Match):
month = int(m.group(1))
day = int(m.group(2))
year = int(m.group(3))
b1 = day > 0 and day <= 31
b2 = month > 0 and month <= 12
b3 = year > 0
return b1 and b2 and b3
It gives you:
s = "For the period from 7/1/22 through 8/31/22, and 13/2/24 is not a valid date"
dates = [m for m in pattern.finditer(s) if is_valid_date(m)]
>>> [<re.Match object; span=(20, 26), match='7/1/22'>, <re.Match object; span=(35, 42), match='8/31/22'>]
If you want to output a list of tuples:
dates_tp = [(m.group(1), m.group(2), m.group(3)) for m in dates]
>>> [('7', '1', '22'), ('8', '31', '22')]
edit
Actually it looks like this method is also slightly more performant:
pat_1 = r"(?:^|[^0-9])(1[012]|0?[1-9])[/-](1[0-9]|2[0-9]|3[01]|0?[1-9])[/-](\d{2})(?:$| )"
def method_1(s):
res = re.findall(pat_1, s, flags=18)
return res
pat_2 = re.compile(r'(\d{1,2})/(\d{1,2})/(\d{1,4})')
def is_valid_date(m: re.Match):
month = int(m.group(1))
day = int(m.group(2))
year = int(m.group(3))
b1 = day > 0 and day <= 31
b2 = month > 0 and month <= 12
b3 = year > 0
return b1 and b2 and b3
def method_2(s):
dates_tp = [(m.group(1), m.group(2), m.group(3)) for m in pat_2.finditer(s) if is_valid_date(m)]
return dates_tp
import timeit
timeit.timeit("method_1(s)", globals=globals(), number=1000000)
>>> 5.62
timeit.timeit("method_2(s)", globals=globals(), number=1000000)
>>> 4.85