Why does capturing the capture group identified with this regex search pattern fail?-CodePudding

import re

input_text_substring = "durante el transcurso del mes de diciembre de 2350" #example 1
#input_text_substring = "durante el transcurso del mes de diciembre del año 2350" #example 2
#input_text_substring = "durante el transcurso del mes 12 2350" #example 3

##If it is NOT "del año"   "(it doesn't matter how many digits)" or if it is NOT "(it doesn't matter what comes before it)"   "(year of 4 digits)"
if not re.search(r"(?:(?:del|de[\s|]*el|el)[\s|]*(?:año|ano)[\s|]*\d*|.*\d{4}$)", input_text_substring):
    input_text_substring  = " de "   datetime.datetime.today().strftime('%Y')   " "

#For when no previous phrase indicative of context was indicated, for example "del año" and the number of digits is not 4

some_text = r"(?:(?!\.\s*?\n)[^;])*" #a number of month or some other text without dots .  or ;, or \n ((although it must also admit the possible case where there is nothing in the middle or only a whitespace)

#we need to capture the group in the position of the last \d*
m1 = re.search( r"(?:del[\s|]*mes|de[\s|]*el[\s|]*mes|de[\s|]*mes|\d{2})"   some_text   r"(?P<year>\d*)" , str(input_text_substring), re.IGNORECASE, )
#if m1: identified_year = str(m1.groups()["\g<year>"])
if m1: identified_year = str(m1.groups()[0])

input_text_substring = re.sub( r"(?:del[\s|]*mes|de[\s|]*el[\s|]*mes|de[\s|]*mes|\d{2})"   some_text   r"\d*", identified_year, input_text_substring )


print(repr(identified_year))
print(repr(input_text_substring))

This is the wrong output that I get with this code (tested in the example 1):

''
'durante el transcurso '

And this is the correct output that I need:

'2350' #in example 1, 2 and 3
'durante el transcurso del mes de diciembre 2350' #in example 1 and 2
'durante el transcurso del mes 12 2350' #in example 3

Why can't I capture the numeric value of the years (?P<year>\d*) using the capture group references with m1.groups()["\g<year>"] or m1.groups()[0] ?

CodePudding user response：

The <year> part is not matched because the previous pattern is capturing that year with [^;] and a greedy *.

One way to have the previous pattern not consume the year, is to extend the negative look-ahead as follows:

some_text = r"(?:(?!\.\s*?\n|\d{4})[^;])*"
#                           ^^^^^^

In the expected results you want to keep "del mes..." in the final output of input_text_substring, but if that is the case then just don't remove that part of the string with the last call of re.sub -- remove that statement. But maybe you overlooked this in your question?

Finally, [\s|]* is not really what you want: it would match a literal | in your input. Moreover, you seem to want to match at least one white space character. So replace these occurrences with \s .