Home > Back-end >  Python regex doesnt match 1 occurrence with 0 or 1 occurrences operator?
Python regex doesnt match 1 occurrence with 0 or 1 occurrences operator?

Time:08-21

I have date strings of the following forms '8 april 2022', '8 april', 'april' and a regex to try and match any of them

re.findall(r"(\d{1,2})?.*(januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december).*(202\d)?", str)

the problem is, it will return ('8', 'april', '') in case of str = '8 april 2022' so my question is: why does ? ignore 1 occurrence of 202\d when its there? Thank you.

EDIT. With non greedy .*?

re.findall(r"(\d{1,2}).*?(januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december).*?(202\d)?", str)

it still doesnt capture 2022

EDIT 2. Considering the answers a better question would be: Is there a way of saying 'hey regex 1 occurrence is optional but preferable to 0' ?

CodePudding user response:

.* should be rarely used due to the greediness .* after matching month is matching too much and not leaving anything to match in 3rd capture group for year. Also you just need to match 1 spaces between strings. It is important to make part between month and year optional by using a non-capture group as shown below.

You may use this regex with non-optional matches, word boundary and bit of tweaking:

\b(?:(\d{1,2})  )?(januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december)(?:  (202\d))?

RegEx Demo

CodePudding user response:

The .* matches " 2022" and then the (202\d)? matches "", as it's optional and there's nothing left.

The .*? matches "" and then the (202\d)? matches "", as it's optional and the remaining " 2022" doesn't even start with 2.

You wish it would search further so that the (202\d)? matches the "2022", but why should it search further? It already found a match, so it stops and reports that.

CodePudding user response:

On the last part of your regex pattern .*(202\d)?, the 2022 is captured by the .* and consequently (202\d) captured nothing.

This is for your perusal, but may not be exactly as you wanted.

matches = re.findall(r"(?:\d{0,2}\s*)(?:januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december)(?:\s202\d)?", str)

For 3 mei woensdag 2022, this may not be what you wanted exactly but it should work for the year:

matches = re.findall(r"(?:\d{0,2}\s*)(?:\w \s*) (?:\s*202\d)?", str)
  • Related