Regex for extract months and year combination in a date-CodePudding

I am using regex to extract the month and year of pairs of dates in text:

regex = (
    r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
    r"\s?[\.\s\’\’\,\/\'\,\‘\-\–\—]?\s?(\d{4}|\d{2})?\s?\s?((to)|[\|\-\–\—])\s?\s?"
    r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
    r"\s?[\.\s\’\’\,\/\'\,\‘\-\–\—]?\s?(\d{4}|\d{2})|(Present|Now|till\s?(now|date|today)?|current)))"
)

When I test the regex with some inputs that contain the day of the month in some and not in others:

lst = [
    'July 2014 - 28th August 2014',
    'Jan 2012 - 3rd sep 2014',
    'Jan 2008 - May 2012',
    'Jan 2008 and May 2012'
]
for i in lst:
    word = re.finditer(regex,i,re.IGNORECASE)
    for match in word:
        print(match.group())

I get the following output:

Jan 2008 - May 2012

but my expected output is:

July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012

What do I need to change to make the regex match text with an optional day in the date? When a date string includes the day, it is always an ordinal number with a st, nd, rd or th suffix.

CodePudding user response：

You cannot "skip" part of a string during a single match operation, so if you have 26th August, you can't match or capture just 26 August. In these cases, you either need to capture parts of the match and then concatenate them, or replace the parts you do not need as a post-processing step.

So, here, I'd use the post-process replace approach with

import re


day = r'(?:((?:0?[1-9]|[12]\d|3[01])(?:\s*(?:st|[rn]d|th))?)\s*)?'
month = r'(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)' 
year = r'(\d{2}(?:\d{2})?)'
rx_valid = re.compile( fr'\b{day}{month}\s*{year}\s*[-—–]\s*{day}{month}\s*{year}(?!\d)', re.IGNORECASE )
rx_ordinal = re.compile( r'\s*\d \s*(?:st|[rn]d|th)', re.IGNORECASE )

lst = [
    'July 2014 - 28th August 2014',
    'Jan 2012 - 3rd sep 2014',
    'Jan 2008 - May 2012',
    'Jan 2008 and May 2012'
]
for i in lst:
    word = rx_valid.finditer(i)
    for match in word:
        print(rx_ordinal.sub("", match.group()))

Output:

July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012

See the Python demo and the regex demo.