Extracting substring with alternatives using regex in Python-CodePudding

I tried looking for previous posts but couldn't find anything that matches exactly what I'm looking for so here goes.

I'm trying to parse through strings in a dataframe and capture a certain substring (year) if a match is found. The formatting can vary a lot and I figured out a non-elegant way to get it done but I wonder if there is a better way.

Strings can looks like this

Random Text 31.12.2020
1.1. -31.12.2020
010120-311220
31.12.2020
1.1.2020-31.12.2020 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words

I'm looking to find the year, currently by finding the last date and its' year. Current regex is . 3112(\d{2,4})|. 31\.12\.(\d{2,4}) where

it would return 20 in group 1 for 010120-311220, and it would return 2020 in group 2 for 1.1.2020-31.12.2020 -.

The problem is I cannot know beforehand which group the match will belong to, as in the first example group 2 doesn't exist and in the second example group 1 will return None when using re.match(regexPattern, stringOfInterest). Therefore I couldn't access the value by naively using .group(1) on the match object, as sometimes the value would be in .group(2).

Best I've come up so far is naming the groups with (?P<groupName>\d{2,4) and checking for Nones

def getYear(stringOfInterest):
    regexPattern = '(^|. )3112(?P<firstMatchType>\d{2,4})|(^|. )31\.12\.(?P<secondMatchType>\d{2,4})'
    matchObject = re.match(regexPattern, stringOfInterest)
    if matchObject is not None:
        matchDict = matchObject.groupdict()
        if matchDict['firstMatchType'] is not None:
            return matchDict['firstMatchType']
        else:
            return matchDict['secondMatchType']
    return None

import re
df['year'] = df['text'].apply(getYear)

And while this works it intuitively seems like a stupid way to do it. Any ideas?

CodePudding user response：

We can try using re.findall here against your input list, with a regex alternation covering both variants:

inp = ["Random Text 31.12.2020", "1.1. -31.12.2020", "010120-311220", "31.12.2020", "1.1.2020-31.12.2020 -", "1.1.2019 - 31.12.2019", "1.1. . . 31.12.2019 -", "1.1.2019 - -31.12.2019", "010120-311220 other random words"]
output = [re.findall(r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})', x)[-1] for x in inp]
output = [x[0] if x[0] else x[1] for x in output]
print(output)  # ['2020', '2020', '20', '2020', '2020', '2019', '2019', '2019', '20']

The strategy here is to match either of the two date variants. We retain the last match for each input. Then, we use a list comprehension to find the non empty value. Note that there are two capture groups, so only one will ever match.

CodePudding user response：

Your regex can be factorized a lot by grouping just the alternation of the beginning of the date; this removes the need to check for two groups:

regexPattern = r'(?:^|. )(?:3112|31\.12\.)(?P<year>\d{2,4})'

Once the group is extracted, it can be normalized into a proper four-digit year:

if matchObject is not None:
    return ('20'   matchObject.group('year'))[-4:]

All in all, we get:

import re

def getYear(stringOfInterest):
    regexPattern = r'(?:^|. )(?:3112|31\.12\.)(?P<year>\d{2,4})'
    matchObject = re.match(regexPattern, stringOfInterest)
    if matchObject is not None:
        return ('20'   matchObject.group('year'))[-4:]
    return None

df['year'] = df['text'].apply(getYear)

CodePudding user response：

this is my approach to your problem, maybe it would be useful


import re
string = '''
Random Text 31.12.2020
1.1. -31.12.2022
010120-311220
31.12.2020
1.1.2020-31.12.2018 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words'''
pattern = r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})'
matches = re.findall(pattern, string)
print("1) ", matches)

# convert tuple to list
match_array = [i for sub in matches for i in sub]
print(match_array)

#Remove multiple empty spaces from string List
res = [element for element in match_array if element.strip()]
print(res)

CodePudding user response：

It looks like all your years are from the XXI^st century. In this case, all you need is

df['year'] = '20'   df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)

See the regex demo. Details:

.* - any zero or more chars other than line break chars as many as possible
31\.?12\.? - 31, an optional ., 12, and an optional . char
(?:\d{2})? - an optional sequence of two digits
(\d{2}) - Group 1: two last digits of the year.

See a Pandas test:

import pandas as pd
df = pd.DataFrame({'text': ['Random Text 31.12.2020','1.1. -31.12.2020','010120-311220','31.12.2020','1.1.2020-31.12.2020 -','1.1.2019 - 31.12.2019','1.1. . . 31.12.2019 -','1.1.2019 - -31.12.2019','010120-311220 other random words']})
df['year'] = '20'   df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)

Output:

>>> df
                               text  year
0            Random Text 31.12.2020  2020
1                  1.1. -31.12.2020  2020
2                     010120-311220  2020
3                        31.12.2020  2020
4             1.1.2020-31.12.2020 -  2020
5             1.1.2019 - 31.12.2019  2019
6             1.1. . . 31.12.2019 -  2019
7            1.1.2019 - -31.12.2019  2019
8  010120-311220 other random words  2020