import re, datetime
input_text = "Alrededor de las 00:16 am o las 23:30 pm 2022_-_02_-_18 , quizas cerca del 2022_-_02_-_18 llega el avion, pero no (2022_-_02_-_18 20:16 pm) a las (2022_-_02_-_18 00:16 am), de esos hay dos (22)"
print(repr(input_text)) # --> output
input_date_structure = r"(?P<year>\d*)_-_(?P<month>\d{2})_-_(?P<startDay>\d{2})"
identify_only_date_regex_00 = input_date_structure r"[\s|]*" r"(\b\d{2}:\d{2}[\s|]*[ap]m)?" #to identify if there is a time after the date
identify_only_date_regex_01 = r"(\b\d{2}:\d{2}[\s|]*[ap]m)?" r"[\s|]*" input_date_structure #to identify if there is a time before the date
date_restructuring_structure = r"\g<year>_-_\g<month>_-_\g<startDay>"
restructuring_only_date = lambda x: x.group() if x.group(1) else "(" fr"{x.expand(date_restructuring_structure)}" " 00:00 am)"
#do the replace with re.sub() method and the regex patterns instructions
input_text = re.sub(identify_only_date_regex_00, restructuring_only_date, input_text)
input_text = re.sub(identify_only_date_regex_01, restructuring_only_date, input_text)
#print output
print(repr(input_text)) # --> output
The wrong output that I get:
'Alrededor de las 00:16 am o las 23:30 pm 2022_-_02_-_18 , quizas cerca del(2022_-_02_-_18 00:00 am) llega el avion, pero no ((2022_-_02_-_18 00:00 am) 20:16 pm) a las ((2022_-_02_-_18 00:00 am) 00:16 am), de esos hay dos (22)'
The correct output, where only dates that were not preceded or followed by times hh:mm am or pm, indicated as r"(\d{2}:\d{2}[ \s|]*[ap]m)?"
, are modified:
"Alrededor de las 00:16 am o las 23:30 pm 2022_-_02_-_18 , quizas cerca del (2022_-_02_-_18 00:00 am) llega el avion, pero no (2022_-_02_-_18 20:16 pm) a las (2022_-_02_-_18 00:16 am), de esos hay dos (22)"
I don't understand why it's failing, at least I think I'm conditioning my regex correctly using \b
and ?
Not replace
"sdsdds 2022_-_02_-_18 00:16 am sdsddssd2
Not replace
"sdsdsd 00:16 am 2022_-_02_-_18 sdsdsd"
replace
"sdsdds 2022_-_02_-_18 sdsdsd"
CodePudding user response:
You can merge the two regexps to form an expression like (Group1)?(...)(Group5)?
(5
is due to the fact you have three capturing groups in the middle part, and even though they are named capturing groups, they are still assigned a numeric ID), and then check if Group 1 or 5 is matched inside the lambda:
import re, datetime
input_text = "Alrededor de las 00:16 am o las 23:30 pm 2022_-_02_-_18 , quizas cerca del 2022_-_02_-_18 llega el avion, pero no (2022_-_02_-_18 20:16 pm) a las (2022_-_02_-_18 00:16 am), de esos hay dos (22)"
input_date_structure = r"(?P<year>\d*)_-_(?P<month>\d{2})_-_(?P<startDay>\d{2})"
identify_only_date_regex = r"(\b\d{2}:\d{2}[\s|]*[ap]m)?[\s|]*" input_date_structure r"[\s|]*(\b\d{2}:\d{2}[\s|]*[ap]m)?"
date_restructuring_structure = r"\g<year>_-_\g<month>_-_\g<startDay>"
restructuring_only_date = lambda x: x.group() if x.group(1) or x.group(5) else "(" x.expand(date_restructuring_structure) " 00:00 am)"
input_text = re.sub(identify_only_date_regex, restructuring_only_date, input_text)
print(repr(input_text)) # --> output
See the Python demo.
The output is
Alrededor de las 00:16 am o las 23:30 pm 2022_-_02_-_18 , quizas cerca del(2022_-_02_-_18 00:00 am)llega el avion, pero no (2022_-_02_-_18 20:16 pm) a las (2022_-_02_-_18 00:16 am), de esos hay dos (22)