import re
input_text_substring = "durante el transcurso del mes de diciembre de 2350" #example 1
#input_text_substring = "durante el transcurso del mes de diciembre del año 2350" #example 2
#input_text_substring = "durante el transcurso del mes 12 2350" #example 3
##If it is NOT "del año" "(it doesn't matter how many digits)" or if it is NOT "(it doesn't matter what comes before it)" "(year of 4 digits)"
if not re.search(r"(?:(?:del|de[\s|]*el|el)[\s|]*(?:año|ano)[\s|]*\d*|.*\d{4}$)", input_text_substring):
input_text_substring = " de " datetime.datetime.today().strftime('%Y') " "
#For when no previous phrase indicative of context was indicated, for example "del año" and the number of digits is not 4
some_text = r"(?:(?!\.\s*?\n)[^;])*" #a number of month or some other text without dots . or ;, or \n ((although it must also admit the possible case where there is nothing in the middle or only a whitespace)
#we need to capture the group in the position of the last \d*
m1 = re.search( r"(?:del[\s|]*mes|de[\s|]*el[\s|]*mes|de[\s|]*mes|\d{2})" some_text r"(?P<year>\d*)" , str(input_text_substring), re.IGNORECASE, )
#if m1: identified_year = str(m1.groups()["\g<year>"])
if m1: identified_year = str(m1.groups()[0])
input_text_substring = re.sub( r"(?:del[\s|]*mes|de[\s|]*el[\s|]*mes|de[\s|]*mes|\d{2})" some_text r"\d*", identified_year, input_text_substring )
print(repr(identified_year))
print(repr(input_text_substring))
This is the wrong output that I get with this code (tested in the example 1):
''
'durante el transcurso '
And this is the correct output that I need:
'2350' #in example 1, 2 and 3
'durante el transcurso del mes de diciembre 2350' #in example 1 and 2
'durante el transcurso del mes 12 2350' #in example 3
Why can't I capture the numeric value of the years (?P<year>\d*)
using the capture group references with m1.groups()["\g<year>"]
or m1.groups()[0]
?
CodePudding user response:
The <year>
part is not matched because the previous pattern is capturing that year with [^;]
and a greedy *
.
One way to have the previous pattern not consume the year, is to extend the negative look-ahead as follows:
some_text = r"(?:(?!\.\s*?\n|\d{4})[^;])*"
# ^^^^^^
In the expected results you want to keep "del mes..." in the final output of input_text_substring
, but if that is the case then just don't remove that part of the string with the last call of re.sub
-- remove that statement. But maybe you overlooked this in your question?
Finally, [\s|]*
is not really what you want: it would match a literal |
in your input. Moreover, you seem to want to match at least one white space character. So replace these occurrences with \s
.