Home > Software engineering >  How to set capturing groups to extract and replace with re.sub()
How to set capturing groups to extract and replace with re.sub()

Time:09-04

import re


#input_text_substring = "a partir de las 04:00 am 2022-09-02 hasta a las 04:15 pm 2022-09-04"
input_text_substring = "a partir de las 04:00 am 2022-09-02 fuimos a caminar hasta las 10 montanas, de alli hasta a las 04:15 pm 2022-09-04"


time_in_numbers = r"(\d{2})[\s|]*(?::|)[\s|]*(\d{2})[\s|]*(?:am|pm)"
date_in_numbers = r"\d{4}[\s|]*-[\s|]*\d{2}[\s|]*-[\s|]*\d{2}"
some_text = r"(.*?)" #any substring or no character (the condition will be set by the rest of the large regex)

regexp1 = r"(?:desde|apartir de|a partir de)[\s|]*(?:a esa de|a eso de|a|)[\s|]*(?:las|la|)[\s|]*"   time_in_numbers   r"[\s|]*(?:de|)[\s|]*"   date_in_numbers   r"[\s|]*"   some_text   r"[\s|]*hasta[\s|]*(?:a esa de|a eso de|a|)[\s|]*(?:las|la|)[\s|]*"   time_in_numbers   r"[\s|]*(?:de|)[\s|]*"   date_in_numbers

#Here you should place the capture groups obtained from the previous pattern
replacement1 = r"[(\2 \1)to(\5 \4)][\3]" #I need fix that!!

input_text_substring = re.sub(regexp1, replacement1, input_text_substring)


print(repr(input_text_substring))

The output with this format '[(XXXX-XX-XX XX:XX (am|pm))to(XXXX-XX-XX XX:XX (am|pm))][some_text]', where X is any numeric character, that I need is something like this:

'[(2022-09-02 04:00 am)to(2022-09-04 04:15 pm)][fuimos a caminar hasta las 10 montanas, de alli]'

The problem I'm having is that it prints the original string directly without modifying it, since either this regex pattern doesn't work for this or the replacements with re.sub() are never done.

CodePudding user response:

I didn't have a look if this pattern could be shortened or be more efficient, but a few small little chances were enough to get it working (at least for this example)

#input_text_substring = "a partir de las 04:00 am 2022-09-02 hasta a las 04:15 pm 2022-09-04"
input_text_substring = "a partir de las 04:00 am 2022-09-02 fuimos a caminar hasta las 10 montanas, de alli hasta a las 04:15 pm 2022-09-04"


time_in_numbers = r"(\d{2}[\s|]*(?::|)[\s|]*\d{2})[\s|]*(am|pm)"
date_in_numbers = r"(\d{4}[\s|]*-[\s|]*\d{2}[\s|]*-[\s|]*\d{2})"
some_text = r"(.*?)" #any substring or no character (the condition will be set by the rest of the large regex)

regexp1 = r"(?:desde|apartir de|a partir de)[\s|]*(?:a esa de|a eso de|a|)[\s|]*(?:las|la|)[\s|]*"   time_in_numbers   r"[\s|]*(?:de|)[\s|]*"   date_in_numbers   r"[\s|]*"   some_text   r"[\s|]*hasta[\s|]*(?:a esa de|a eso de|a|)[\s|]*(?:las|la|)[\s|]*"   time_in_numbers   r"[\s|]*(?:de|)[\s|]*"   date_in_numbers

replacement1 = r"[(\3 \1 \2)to(\7 \5 \6)][\4]" 

input_text_substring = re.sub(regexp1, replacement1, input_text_substring)


print(repr(input_text_substring))

Output:

'[(2022-09-02 04:00 am)to(2022-09-04 04:15 pm)][fuimos a caminar hasta las 10 montanas, de alli]'

Check out the pattern at Regex101
The changes I made:

  1. surround date_in_numbers with () to make it its own capturing group
  2. make (am|pm) a capturing group by removing (?:...)
  3. time_in_numbers- the two digits before and after the colon were its own capturing groups. Merged them together to be only one capturing group as a whole.
  4. Adjust the groups in replacement1
  • Related