I tried looking for previous posts but couldn't find anything that matches exactly what I'm looking for so here goes.
I'm trying to parse through strings in a dataframe and capture a certain substring (year) if a match is found. The formatting can vary a lot and I figured out a non-elegant way to get it done but I wonder if there is a better way.
Strings can looks like this
Random Text 31.12.2020
1.1. -31.12.2020
010120-311220
31.12.2020
1.1.2020-31.12.2020 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words
I'm looking to find the year, currently by finding the last date and its' year.
Current regex is . 3112(\d{2,4})|. 31\.12\.(\d{2,4})
where
it would return 20 in group 1 for 010120-311220
,
and it would return 2020 in group 2 for 1.1.2020-31.12.2020 -
.
The problem is I cannot know beforehand which group the match will belong to, as in the first example group 2 doesn't exist and in the second example group 1 will return None when using re.match(regexPattern, stringOfInterest)
. Therefore I couldn't access the value by naively using .group(1)
on the match object, as sometimes the value would be in .group(2)
.
Best I've come up so far is naming the groups with (?P<groupName>\d{2,4)
and checking for Nones
def getYear(stringOfInterest):
regexPattern = '(^|. )3112(?P<firstMatchType>\d{2,4})|(^|. )31\.12\.(?P<secondMatchType>\d{2,4})'
matchObject = re.match(regexPattern, stringOfInterest)
if matchObject is not None:
matchDict = matchObject.groupdict()
if matchDict['firstMatchType'] is not None:
return matchDict['firstMatchType']
else:
return matchDict['secondMatchType']
return None
import re
df['year'] = df['text'].apply(getYear)
And while this works it intuitively seems like a stupid way to do it. Any ideas?
CodePudding user response:
We can try using re.findall
here against your input list, with a regex alternation covering both variants:
inp = ["Random Text 31.12.2020", "1.1. -31.12.2020", "010120-311220", "31.12.2020", "1.1.2020-31.12.2020 -", "1.1.2019 - 31.12.2019", "1.1. . . 31.12.2019 -", "1.1.2019 - -31.12.2019", "010120-311220 other random words"]
output = [re.findall(r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})', x)[-1] for x in inp]
output = [x[0] if x[0] else x[1] for x in output]
print(output) # ['2020', '2020', '20', '2020', '2020', '2019', '2019', '2019', '20']
The strategy here is to match either of the two date variants. We retain the last match for each input. Then, we use a list comprehension to find the non empty value. Note that there are two capture groups, so only one will ever match.
CodePudding user response:
Your regex can be factorized a lot by grouping just the alternation of the beginning of the date; this removes the need to check for two groups:
regexPattern = r'(?:^|. )(?:3112|31\.12\.)(?P<year>\d{2,4})'
Once the group is extracted, it can be normalized into a proper four-digit year:
if matchObject is not None:
return ('20' matchObject.group('year'))[-4:]
All in all, we get:
import re
def getYear(stringOfInterest):
regexPattern = r'(?:^|. )(?:3112|31\.12\.)(?P<year>\d{2,4})'
matchObject = re.match(regexPattern, stringOfInterest)
if matchObject is not None:
return ('20' matchObject.group('year'))[-4:]
return None
df['year'] = df['text'].apply(getYear)
CodePudding user response:
this is my approach to your problem, maybe it would be useful
import re
string = '''
Random Text 31.12.2020
1.1. -31.12.2022
010120-311220
31.12.2020
1.1.2020-31.12.2018 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words'''
pattern = r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})'
matches = re.findall(pattern, string)
print("1) ", matches)
# convert tuple to list
match_array = [i for sub in matches for i in sub]
print(match_array)
#Remove multiple empty spaces from string List
res = [element for element in match_array if element.strip()]
print(res)
CodePudding user response:
It looks like all your years are from the XXIst century. In this case, all you need is
df['year'] = '20' df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)
See the regex demo. Details:
.*
- any zero or more chars other than line break chars as many as possible31\.?12\.?
-31
, an optional.
,12
, and an optional.
char(?:\d{2})?
- an optional sequence of two digits(\d{2})
- Group 1: two last digits of the year.
See a Pandas test:
import pandas as pd
df = pd.DataFrame({'text': ['Random Text 31.12.2020','1.1. -31.12.2020','010120-311220','31.12.2020','1.1.2020-31.12.2020 -','1.1.2019 - 31.12.2019','1.1. . . 31.12.2019 -','1.1.2019 - -31.12.2019','010120-311220 other random words']})
df['year'] = '20' df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)
Output:
>>> df
text year
0 Random Text 31.12.2020 2020
1 1.1. -31.12.2020 2020
2 010120-311220 2020
3 31.12.2020 2020
4 1.1.2020-31.12.2020 - 2020
5 1.1.2019 - 31.12.2019 2019
6 1.1. . . 31.12.2019 - 2019
7 1.1.2019 - -31.12.2019 2019
8 010120-311220 other random words 2020