Home > Enterprise >  Remove the current year if another year was previously indicated after this regex pattern
Remove the current year if another year was previously indicated after this regex pattern

Time:10-17

This is my code where I indicate some possible examples to simulate the environment where this program will work

import re, datetime

#Example input cases
input_text = '[26 -- 31] de 10 del 200 de 2022' #example 1
input_text = '[26 -- 31] de 12 del 206 del 2022' #example 2
input_text = '[06 -- 11] del 09 del ano 2020 del 2022' #example 3
input_text = '[06 -- 06] del mes 09 del ano 20 del ano 2022' #example 4
input_text = '[16 -- 06] del mes 09 del 2022' #example 5 (not modify)


possible_year_num = r"\d*" #I need one or more numbers (one or more numeric digits but never any number)

current_year = datetime.datetime.today().strftime('%Y')

month_context_regex = r"[\s|]*(?:del[\s|]*mes|de[\s|]*el[\s|]*mes|de[\s|]*mes|del|de[\s|]*el|de)[\s|]*"
year_context_regex = r"[\s|]*(?:del[\s|]*año|de[\s|]*el[\s|]*año|de[\s|]*año|del[\s|]*ano|de[\s|]*el[\s|]*ano|de[\s|]*ano|del|de[\s|]*el|de)[\s|]*"

#I combine those modular regex expressions to build a regex that serves to identify the substring in which the replacement must be performed
identity_replacement_case_regex = r"\[\d{2}"   " -- "   r"\d{2}]"   month_context_regex   r"\d{2}"   year_context_regex   possible_year_num   year_context_regex   current_year

#Only in this cases, I need replace with re.sub() and obtain this output string, for example 1, '[26 -- 31] de 10 del 200'
replacement_without_current_year = r"\[\d{2}"   " -- "   r"\d{2}]"   month_context_regex   r"\d{2}"   year_context_regex   possible_year_num
input_text = re.sub(identity_replacement_case_regex, replacement_without_current_year, input_text)

print(repr(input_text))  # --> output

The correct outputs should look like this:

'[26 -- 31] de 10 del 200' #for example 1
'[26 -- 31] de 12 del 206' #for example 2
'[06 -- 11] del 09 del ano 2020' #for example 3
'[06 -- 06] del mes 09 del ano 20' #for example 4
'[16 -- 06] del mes 09 del 2022' #for example 5 (not modified)

How should I put this replacement in the re.sub() function to get these outputs?

I get this error, when I try this replacement

Traceback (most recent call last):
input_text_substring = re.sub(identity_replacement_case_regex, replacement_without_current_year, input_text_substring)
raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \d at position 2

CodePudding user response:

Rule:

\D (?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022

Demo: https://regex101.com/r/NRUEYO/1

Code:

import re

regex = r"\D (?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022"

input_text = '[26 -- 31] de 10 del 200 de 2022' #example 1
input_text = '[26 -- 31] de 12 del 206 del 2022' #example 2
input_text = '[06 -- 11] del 09 del ano 2020 del 2022' #example 3
input_text = '[06 -- 06] del mes 09 del ano 20 del ano 2022' #example 4
input_text = '[16 -- 06] del mes 09 del 2022' #example 5 (not modify)

replace_text = ""

result = re.sub(regex, replace_text, input_text)

if result:
    print (result)
  • \D => Any non-digit character
  • \d => Any digit character
  • \d{2} => Two digit character
  • \S => Any non-whitespace character
  • \S{3} => Three non-whitespace character
  • (?<!A)2022 => There must not be an "A" character before 2022
  • (?<!\D\d{2} \S{3} )2022 => There must not be an three character word before the 2022 and after the two-digit characters.
  • (?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022 => There must not be an three or two character word before the 2022 and after the two-digit characters.
  • \D (?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022 => Capture all non-digit characters before the (?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022
  • Related