This is my code where I indicate some possible examples to simulate the environment where this program will work
import re, datetime
#Example input cases
input_text = '[26 -- 31] de 10 del 200 de 2022' #example 1
input_text = '[26 -- 31] de 12 del 206 del 2022' #example 2
input_text = '[06 -- 11] del 09 del ano 2020 del 2022' #example 3
input_text = '[06 -- 06] del mes 09 del ano 20 del ano 2022' #example 4
input_text = '[16 -- 06] del mes 09 del 2022' #example 5 (not modify)
possible_year_num = r"\d*" #I need one or more numbers (one or more numeric digits but never any number)
current_year = datetime.datetime.today().strftime('%Y')
month_context_regex = r"[\s|]*(?:del[\s|]*mes|de[\s|]*el[\s|]*mes|de[\s|]*mes|del|de[\s|]*el|de)[\s|]*"
year_context_regex = r"[\s|]*(?:del[\s|]*año|de[\s|]*el[\s|]*año|de[\s|]*año|del[\s|]*ano|de[\s|]*el[\s|]*ano|de[\s|]*ano|del|de[\s|]*el|de)[\s|]*"
#I combine those modular regex expressions to build a regex that serves to identify the substring in which the replacement must be performed
identity_replacement_case_regex = r"\[\d{2}" " -- " r"\d{2}]" month_context_regex r"\d{2}" year_context_regex possible_year_num year_context_regex current_year
#Only in this cases, I need replace with re.sub() and obtain this output string, for example 1, '[26 -- 31] de 10 del 200'
replacement_without_current_year = r"\[\d{2}" " -- " r"\d{2}]" month_context_regex r"\d{2}" year_context_regex possible_year_num
input_text = re.sub(identity_replacement_case_regex, replacement_without_current_year, input_text)
print(repr(input_text)) # --> output
The correct outputs should look like this:
'[26 -- 31] de 10 del 200' #for example 1
'[26 -- 31] de 12 del 206' #for example 2
'[06 -- 11] del 09 del ano 2020' #for example 3
'[06 -- 06] del mes 09 del ano 20' #for example 4
'[16 -- 06] del mes 09 del 2022' #for example 5 (not modified)
How should I put this replacement in the re.sub()
function to get these outputs?
I get this error, when I try this replacement
Traceback (most recent call last):
input_text_substring = re.sub(identity_replacement_case_regex, replacement_without_current_year, input_text_substring)
raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \d at position 2
CodePudding user response:
Rule:
\D (?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022
Demo: https://regex101.com/r/NRUEYO/1
Code:
import re
regex = r"\D (?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022"
input_text = '[26 -- 31] de 10 del 200 de 2022' #example 1
input_text = '[26 -- 31] de 12 del 206 del 2022' #example 2
input_text = '[06 -- 11] del 09 del ano 2020 del 2022' #example 3
input_text = '[06 -- 06] del mes 09 del ano 20 del ano 2022' #example 4
input_text = '[16 -- 06] del mes 09 del 2022' #example 5 (not modify)
replace_text = ""
result = re.sub(regex, replace_text, input_text)
if result:
print (result)
\D
=> Any non-digit character\d
=> Any digit character\d{2}
=> Two digit character\S
=> Any non-whitespace character\S{3}
=> Three non-whitespace character(?<!A)2022
=> There must not be an "A" character before 2022(?<!\D\d{2} \S{3} )2022
=> There must not be an three character word before the 2022 and after the two-digit characters.(?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022
=> There must not be an three or two character word before the 2022 and after the two-digit characters.\D (?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022
=> Capture all non-digit characters before the(?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022