Home > Back-end >  How to extract this data from a string using regex.group()?
How to extract this data from a string using regex.group()?

Time:09-29

import re   #library for using regular expressions

#input examples:
input_text = "6 de la manana hdhd" #example 1
input_text = "hdhhd 06: de la manana hdhd" #example 2
input_text = "hd:00 06 : de la manana hdhd" #example 3
input_text = "hdhhd 6 de la manana hdhd" #example 4
input_text = "hdhhd 06:00 de la manana hdhd" #example 5
input_text = "hdhhd 06 : 18 de la manana hdhd" #example 6
input_text = "hdhhd 18 de la manana hdhd" #example 7
input_text = "hdhhd 18:18 de la manana hdhd" #example 8
input_text = "hdhhd 18 : 00 de la manana hdhd" #example 9
input_text = "hdhhd 19 : 19 de la noche hdhd" #example 10
input_text = "hdhhd 6  de la noche hdhd" #example 11

This is my code where I have managed to put together the structure of the replacements but I have not yet been able to extract the data that I will need in the processand I need fix that (I have put pseudocode mixed in those parts that I did not get):

am_list = ["manana", "mañana", "mediodia", "medio dia","madrugada","amanecer"]
pm_list = ["atardecer", "tarde", "ocaso", "noche", "anochecer"]

is_am_time,is_pm_time = False, False
hour_number_fixed, civil_time_fixed = "", ""
 
re_pattern_for_am = r"\d{1,2})[\s|:]*(\d{0,2})\s*(?:de la |de el)"   am_list
if(identification condition for am):
    #extract with re.group()
    hour_number = int()  # <--- \d{1,2}
    am_or_pm = str()     # <--- am_list

re_pattern_for_pm = r"\d{1,2})[\s|:]*(\d{0,2})\s*(?:de la |de el)"   pm_list
if(identification condition for pm):
    #extract with re.group()
    hour_number = int()  # <--- \d{1,2}
    am_or_pm = str()     # <--- pm_list

if(am_or_pm == one element in am_list): is_am_time = True
elif(am_or_pm == one element in pm_list): is_pm_time = True


if(is_am_time == True):
    if ( hour_number >= 12 ): civil_time_fixed = "pm"
    else: civil_time_fixed = "am"
    hour_number_fixed = str(hour_number)

elif(is_pm_time == True):
    if ( hour_number < 12 ):  hour_number_fixed = str(hour_number   12 ) 
    civil_time_fixed = "pm"

#replacement process
input_text = input_text.replace(hour_number, hour_number_fixed, 1)
input_text = input_text.replace(am_or_pm, civil_time_fixed, 1)

print(repr(input_text))

I need the program to decide and correct the schedules "\d{1,2})[\s|:]*(\d{0,2})\s*", using the data (hour_number and am_or_pm) that it must extract from the input_string with re.group().

The correct output in each case:

"6 am hdhd"                   #for the example 1
"hdhhd 06: am hdhd"           #for the example 2
"hd:00 06 : am hdhd"          #for the example 3
"hdhhd 6 am hdhd"             #for the example 4
"hdhhd 06:00 am hdhd"         #for the example 5
"hdhhd 06 : 18 am hdhd"       #for the example 6
"hdhhd 18 pm hdhd"            #for the example 7
"hdhhd 18:18 pm hdhd"         #for the example 8
"hdhhd 18 : 00 pm hdhd"       #for the example 9
"hdhhd 19 : 19 pm hdhd"       #for the example 10
"hdhhd 18 pm hdhd"            #for the example 11

How do I do those data extractions with re.group() (or similar method) in this code?

CodePudding user response:

It seems imprudent to attempt a full solution so here is an example of how to extract the hour using named groups with a simplified regex.

input_text = "hdhhd 06:00 de la manana hdhd"
match = re.search(r"(?P<hour>\d\d?):(?P<minutes>\d\d)", input_text)
hour = match.group('hour')
print(hour)    # 06

Other than that what are the specific aspects of your problem that you are struggling with?

CodePudding user response:

First, note that normalizing the hour is beyond the capabilities of regular expressions, so that will need to be performed in Python. Fortunately, re.sub accepts a function to create the replacement string.

Regex

The sample regex has a few large issues:

  • The group to capture the time is missing an open parentheses to start the group.
  • You can't add a string and a list; the lists must be joined with a separator.
  • The AM and PM word patterns can't simply be appended to the main patterns; they each must be in a group so they can use alternation.

There's also a minor issue: the pattern will fail for strings with 'de el' because there's no space between 'el' and the AM or PM word.

Note you can combine the two regexes into one, and then check whether the AM or PM subpattern was matched. An easy way to do this is to use two named groups, one for AM and one for PM, with the words and phrases for each period in the corresponding group.

The sub-expressions to match the hour and minute can also be named, for clarity of access. The time expression could also be named.

This gives the following Python to create the pattern:

am_pattern = '|'.join(am_list)
pm_pattern = '|'.join(pm_list)
time_pattern = r"(?P<time>(?P<hour>\d{1,2})(?P<minute>[\s:|]*\d{0,2}))
pattern = f'{time_pattern}\s*(?:de la|de el)\s(?:(?P<am>{am_pattern})|(?P<pm>{pm_pattern}))'

Evaluated (in free-spacing mode, for clarity), the regex is:

(?P<time>
  (?P<hour>\d{1,2})
  (?P<minute>[\s:|]*\d{0,2})
)
\s*(?:de la|de el)\s
(?:
  (?P<am>manana|mañana|mediodia|medio dia|madrugada|amanecer)
|
  (?P<pm>atardecer|tarde|ocaso|noche|anochecer)
)

There are a few minor improvements that can be made, such as:

  • Anchoring the beginning of the time sub-pattern at a word boundary to prevent a match when there are more than two digits (the existing pattern will match '123: 45', as the \d{1,2} will match the '23' of '123').
  • The time sub-pattern will match any string of 3 or 4 digits, as the separator isn't required. Instead, require the separator and make the minute sub-pattern optional.

With these changes, the regex construction becomes:

am_pattern = '|'.join(am_list)
pm_pattern = '|'.join(pm_list)
time_pattern = r"(?P<time>(?P<hour>\b\d{1,2})(?P<minute>[\s:|] \d{0,2})?)"
pattern = f'{time_pattern}\s*de (?:la|el) (?:(?P<am>{am_pattern})|(?P<pm>{pm_pattern}))'

Evaluated:

(P<time>
  (?P<hour>\b\d{1,2})
  (?P<minute>[\s:|] \d{1,2})?
)
\s*de (?:la|el)\s
(?:
  (?P<am>manana|mañana|mediodia|medio dia|madrugada|amanecer)
|
  (?P<pm>atardecer|tarde|ocaso|noche|anochecer)
)'

Python

With the above regex to extract the necessary information, the replacement function has a few tasks:

  • check whether an AM or PM word was matched, and then use the correct replacement
  • AM/PM check & correct
  • hour check & correct
  • trim whitespace
def matched_group(match, groups, default='', throw=False):
    """
    Return the name of the first named group from 'groups' that had a match.
    """
    for group in groups:
        if match.group(group):
            return group
    if throw:
        raise KeyError(f'no group found from ({groups})')
    return default # could also throw

def repl_time(match):
    meridiem = matched_group(match, ['am', 'pm'])
    time, hour, minute = match.group('time', 'hour', 'minute')
    hour = int(hour)
    if hour > 12:
        meridiem = 'pm'
    elif 'pm' == meridiem: # hour <= 12
        hour  = 12
        time = str(hour)   minute
    return time.rstrip()   ' '   meridiem

reTime.sub(repl_time, input_text)

Applying the above to the samples produces the desired results:

samples = [
    "6 de la manana hdhd",
    "hdhhd 06: de la manana hdhd",
    "hd:00 06 : de la manana hdhd",
    "hdhhd 6 de la manana hdhd",
    "hdhhd 06:00 de la manana hdhd",
    "hdhhd 06 : 18 de la manana hdhd",
    "hdhhd 18 de la manana hdhd",
    "hdhhd 18:18 de la manana hdhd",
    "hdhhd 18 : 00 de la manana hdhd",
    "hdhhd 19 : 19 de la noche hdhd",
    "hdhhd 6  de la noche hdhd",
    ]

[reTime.sub(repl_time, sample) for sample in samples]

Results:

[
    '6 am hdhd',
    'hdhhd 06: am hdhd',
    'hd:00 06 : am hdhd',
    'hdhhd 6 am hdhd',
    'hdhhd 06:00 am hdhd',
    'hdhhd 06 : 18 am hdhd',
    'hdhhd 18 pm hdhd',
    'hdhhd 18:18 pm hdhd',
    'hdhhd 18 : 00 pm hdhd',
    'hdhhd 19 : 19 pm hdhd',
    'hdhhd 18 pm hdhd'
]
  • Related