import re
input_text = '2000_-_9_-_01 8:1 am' #example 1
input_text = '(2000_-_1_-_01) 18:1 pm' #example 2
input_text = '(20000_-_12_-_1) (1:1 am)' #example 3
identificate_hours = r"(?:a\s*las|a\s*la|)\s*(\d{1,2}):(\d{1,2})\s*(?:(am)|(pm)|)"
date_format_00 = r"(\d*)_-_(\d{1,2})_-_(\d{1,2})"
identification_re_0 = r"(?:\(|)\s*" date_format_00 r"\s*(?:\)|)\s*(?:a\s*las|a\s*la|)\s*(?:\(|)\s*" identificate_hours r"\s*(?:\)|)"
input_text = re.sub(identification_re_0,
#lambda m: print(m[2]),
lambda m: (f"({m[1]}_-_{m[2]}_-_{m[3]}({m[4] or '00'}:{m[5] or '00'} {m[6] or m[7] or 'am'}))"),
input_text, re.IGNORECASE)
print(repr(input_text)) # --> output
Considering that there are 5 numerical values(year, month, day, hour, minutes) where possibly it should be corrected by adding a "0"
, and being 2 possibilities (add a zero "0"
or not add a zero "0"
), after using the combinatorics formula I can know that there would be a total of 32 possible combinations, which it's too much to come up with 32 different regex that do or don't add a "0"
in front of every value that needs it. For this reason I feel that trying to repeat the regex, changing only the "(\d{1,2})"
one by one, would not be a good solution for this problem.
I was trying to standardize date-time data that is entered by users in natural language so that it can then be processed.
So, once the dates were obtained in this format, I needed those numerical values of months, days, hours and/or minutes that have remained with a single digit are standardized to 2 digits, placing a "0"
before them to compensate for possible missing digits.
So that in the output the input date-time are expressed in this way:
YYYY_-_MM_-_DD(hh:mm am or pm)
'(2000_-_09_-_01(08:01 am))' #for example 1
'(2000_-_01_-_01(18:01 pm))' #for example 2
'(20000_-_12_-_01(18:01 am))' #for example 3
I have used the re.sub()
function because it contemplates the possibility that within the same input_text
there is more than one occasion where a replacement of this type must be carried out. For example, in an input where '2000_-_9_-_01 8:1 am 2000_-_9_-_01 8:1 am'
, you should perform this procedure 2 times since there are 2 dates present (that is, there are 2 times where this pattern appears), and obtain this '(2000_-_09_-_01(08:01 am)) (2000_-_09_-_01(08:01 am))'
CodePudding user response:
I'm not sure I fully understood you, but I would solve it with datetime
instead of regex
. But that doesn't support the year 20000, typo? or are you planning way ahead? :-D
from datetime import datetime
testDates = [
'2000_-_9_-_01 8:1 am', #example 1
'(2000_-_1_-_01) 18:1 pm', #example 2
'(2000_-_12_-_1) (1:1 am)', #example 3
]
for testDate in testDates:
testDateClean = testDate
for rm in ('(', ')'):
testDateClean = testDateClean.replace(rm, '')
date = datetime.strptime(testDateClean, '%Y_-_%m_-_%d %H:%M %p')
print(date.strftime('%Y_-_%m_-_%d(%H:%M %p)'))
A regex solution which can handle all provided example strings:
import re
INPUT_DATES = [
'(2000_-_09_-_01 (08:01 am)) (2001_-_10_-_01 (09:02 am))',
'(20000_-_1_-_01) 18:1 pm',
'2000_-_9_-_01 8:1 am',
'(2000_-_12_-_1) (1:1 am)',
]
REGEX_SPLIT = re.compile(r'\(([\dpam_\- :\(\)]{10,})\) \(([\dpam_\- :\(\)]{10,})\)')
REGEX_DATE = re.compile(r'(?P<year>\d{4,})_-_(?P<month>\d{1,2})_-_(?P<day>\d{1,2}) (?P<hour>\d{1,2}):(?P<minute>\d{1,2}) (?P<apm>[apm]{2})')
for testDates in INPUT_DATES:
testDates = REGEX_SPLIT.split(testDates)
for testDate in testDates:
if len(testDate) < 10:
continue
testDateClean = testDate
for rm in ('(', ')'):
testDateClean = testDateClean.replace(rm, '')
date = REGEX_DATE.match(testDateClean).groupdict()
print(f'parsed out: {date["year"]}_-_{date["month"]:>02}_-_{date["day"]:>02}({date["hour"]:>02}:{date["minute"]:>02} {date["apm"]}), from in: {testDate}')
output:
parsed out: 2000_-_09_-_01(08:01 am), from in: 2000_-_09_-_01 (08:01 am)
parsed out: 2001_-_10_-_01(09:02 am), from in: 2001_-_10_-_01 (09:02 am)
parsed out: 20000_-_01_-_01(18:01 pm), from in: (20000_-_1_-_01) 18:1 pm
parsed out: 2000_-_09_-_01(08:01 am), from in: 2000_-_9_-_01 8:1 am
parsed out: 2000_-_12_-_01(01:01 am), from in: (2000_-_12_-_1) (1:1 am)