I have date strings of the following forms '8 april 2022', '8 april', 'april' and a regex to try and match any of them
re.findall(r"(\d{1,2})?.*(januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december).*(202\d)?", str)
the problem is, it will return ('8', 'april', '')
in case of str = '8 april 2022'
so my question is: why does ?
ignore 1 occurrence of 202\d
when its there?
Thank you.
EDIT. With non greedy .*?
re.findall(r"(\d{1,2}).*?(januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december).*?(202\d)?", str)
it still doesnt capture 2022
EDIT 2. Considering the answers a better question would be: Is there a way of saying 'hey regex 1 occurrence is optional but preferable to 0' ?
CodePudding user response:
.*
should be rarely used due to the greediness .*
after matching month
is matching too much and not leaving anything to match in 3rd capture group for year. Also you just need to match 1 spaces between strings. It is important to make part between month and year optional by using a non-capture group as shown below.
You may use this regex with non-optional matches, word boundary and bit of tweaking:
\b(?:(\d{1,2}) )?(januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december)(?: (202\d))?
CodePudding user response:
The .*
matches " 2022"
and then the (202\d)?
matches ""
, as it's optional and there's nothing left.
The .*?
matches ""
and then the (202\d)?
matches ""
, as it's optional and the remaining " 2022"
doesn't even start with 2
.
You wish it would search further so that the (202\d)?
matches the "2022"
, but why should it search further? It already found a match, so it stops and reports that.
CodePudding user response:
On the last part of your regex pattern .*(202\d)?
, the 2022 is captured by the .*
and consequently (202\d)
captured nothing.
This is for your perusal, but may not be exactly as you wanted.
matches = re.findall(r"(?:\d{0,2}\s*)(?:januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december)(?:\s202\d)?", str)
For 3 mei woensdag 2022
, this may not be what you wanted exactly but it should work for the year:
matches = re.findall(r"(?:\d{0,2}\s*)(?:\w \s*) (?:\s*202\d)?", str)