Home > Blockchain >  Regex of replacements conditioned by previous regex patterns fails to capture any of the strings
Regex of replacements conditioned by previous regex patterns fails to capture any of the strings

Time:01-31

import re

input_text = "En esta alejada ciudad por la tarde circulan muchos camiones con aquellos acoplados rojos, grandes y bastante pesados, llevándolos por esos trayectos bastante empedrados, polvorientos, y un tanto arenosos. Y incluso bastante desde lejos ya se les puede ver." #example string

list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "llevándoles", "llevándole", "llevándolos", "llevándolo", "circularías", "circularía", "circulando", "circulan", "circula", "consiste", "consistían", "consistía", "consistió", "visualizar", "ver", "empolvarle", "empolvar", "verías", "vería", "vieron", "vió", "vio", "ver", "podrías" , "podría", "puede"]

exclude = rf"(?!\b(?:{'|'.join(list_verbs_in_this_input)})\b)"
direct_subject_modifiers, noun_pattern = exclude   r"\w " , exclude   r"\w "

#modifier_connectors = r"(?:(?:,\s*|)y|(?:,\s*|)y|,)\s*(?:(?:(?:a[úu]n|todav[íi]a|incluso)\s |)(?:de\s*gran|bastante|un\s*tanto|un\s*poco|)\s*(?:m[áa]s|menos)\s |)"
modifier_connectors = r"(?:(?:,\s*|)y|(?:,\s*|)y|,)\s*(?:(?:(?:a[úu]n|todav[íi]a|incluso)\s |)(?:(?:de\s*gran|bastante|un\s*tanto|un\s*poco|)\s*(?:m[áa]s|menos)|bastante)\s |)"

enumeration_of_noun_modifiers = direct_subject_modifiers   "(?:"   modifier_connectors    direct_subject_modifiers   "){2,}"

sentence_capture_pattern = r"(?:aquellas|aquellos|aquella|aquel|los|las|el|la|esos|esas|este|ese|otros|otras|otro|otra)\s "   noun_pattern   r"\s "   enumeration_of_noun_modifiers


input_text = re.sub(sentence_capture_pattern, r"((NOUN)\g<0>)", input_text, flags=re.I|re.U)
print(repr(input_text)) # --> output

Capturing a word r"\w " that is before the pattern enumeration_of_noun_modifiers, and then everything that is inside the pattern enumeration_of_noun_modifiers places it inside some ' ', leaving the string restructured in this way...

((NOUN='acoplados rojos, grandes y bastante pesados')aquellos)

((NOUN='trayectos bastante empedrados, polvorientos, y un tanto arenosos')esos)

Keep in mind that in front of r"\w " in the direct_subject_modifiers pattern and in the noun_pattern pattern I have placed exclude since it is in charge of checking that the elements within the capture group do not match any element within that string (in order to avoid false positives )

The string that would be obtained as output that should be obtained after identifying and restructuring those substrings, is the following:

'En esta alejada ciudad por la tarde circulan muchos camiones con ((NOUN='acoplados rojos, grandes y bastante pesados')aquellos), llevándolos por ((NOUN='trayectos bastante empedrados, polvorientos, y un tanto arenosos')esos). Y incluso bastante desde lejos ya se les puede ver.'

What is it that makes these substrings not be identified and my regex sentence_capture_pattern doesn't work?


EDIT CODE:

It is an edition of the code after some modifications, even so it continues to have some bugs..

import re

input_text = "En esta alejada ciudad por la tarde circulan muchos camiones con aquellos acoplados rojos, grandes y bastante pesados, llevándolos por esos trayectos bastante empedrados, polvorientos, y un tanto arenosos. Y incluso bastante desde lejos ya se les puede ver." #example string

list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "llevándoles", "llevándole", "llevándolos", "llevándolo", "circularías", "circularía", "circulando", "circulan", "circula", "consiste", "consistían", "consistía", "consistió", "visualizar", "ver", "empolvarle", "empolvar", "verías", "vería", "vieron", "vió", "vio", "ver", "podrías" , "podría", "puede"]

exclude = rf"(?!\b(?:{'|'.join(list_verbs_in_this_input)})\b)"
direct_subject_modifiers, noun_pattern = exclude   r"\w " , exclude   r"\w "

#includes the word "bastante" as an optional case independent of its happening from the words "(m[áa]s|menos)"
modifier_connectors = r"(?:(?:,\s*|)y|(?:,\s*|)y|,)\s*(?:(?:(?:a[úu]n|todav[íi]a|incluso)\s |)(?:(?:de\s*gran|bastante|un\s*tanto|un\s*poco|)\s*(?:m[áa]s|menos)|bastante)\s |)"

#enumeration_of_noun_modifiers = direct_subject_modifiers   "(?:"   modifier_connectors    direct_subject_modifiers   "){2,}"
enumeration_of_noun_modifiers = direct_subject_modifiers   "(?:"   modifier_connectors    direct_subject_modifiers   ")*"


#sentence_capture_pattern = r"(?:aquellas|aquellos|aquella|aquel|los|las|el|la|esos|esas|este|ese|otros|otras|otro|otra)\s "   noun_pattern   r"\s "   enumeration_of_noun_modifiers
sentence_capture_pattern = r"(?:aquellas|aquellos|aquella|aquel|los|las|el|la|esos|esas|este|ese|otros|otras|otro|otra)\s "   noun_pattern   r"\s "   modifier_connectors   direct_subject_modifiers   r"\s (?:"   enumeration_of_noun_modifiers   r"|)"

# ((NOUN)'    ')
input_text = re.sub(sentence_capture_pattern, r"((NOUN)'\g<0>')", input_text, flags=re.I|re.U)
print(repr(input_text)) # --> output

CodePudding user response:

It is likely that the sentence_capture_pattern is not capturing the correct substrings because it is not considering some important elements that make some words be part of the same group, such as the connector words (such as "y", "o", "e", etc.) and the phrases that modify the nouns such as "un tanto" and "bastante".

It is important to consider these elements in the pattern so that it can correctly identify the substrings that need to be restructured.

sentence_capture_pattern = r"(?:aquellas|aquellos|aquella|aquel|los|las|el|la|esos|esas|este|ese|otros|otras|otro|otra)\s "   noun_pattern   r"\s "   enumeration_of_noun_modifiers 

to

sentence_capture_pattern = r"(?:aquellas|aquellos|aquella|aquel|los|las|el|la|esos|esas|este|ese|otros|otras|otro|otra)\s "   noun_pattern   r"\s "   enumeration_of_noun_modifiers   "*"

The change that was made was in the enumeration_of_noun_modifiers, where the " " was changed to "*" in order to accommodate for multiple modifiers, connectors, and phrases that modify the nouns.

import re

input_text = "En esta alejada ciudad por la tarde circulan muchos camiones con aquellos acoplados rojos, grandes y bastante pesados, llevándolos por esos trayectos bastante empedrados, polvorientos, y un tanto arenosos. Y incluso bastante desde lejos ya se les puede ver." #example string

list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "llevándoles", "llevándole", "llevándolos", "llevándolo", "circularías", "circularía", "circulando", "circulan", "circula", "consistent", "consistían", "consistía", "consistió", "visualizar", "ver", "empolvarle", "empolvar", "verías", "vería", "vieron", "vió", "vio", "ver", "podrías" , "podría", "puede"]

exclude = rf"(?!\b(?:{'|'.join(list_verbs_in_this_input)})\b)"
direct_subject_modifiers, noun_pattern = exclude   r"\w " , exclude   r"\w "

modifier_connectors = r"(?:(?:,\s*|)y|(?:,\s*|)y|,)\s*(?:(?:(?:a[úu]n|todav[íi]a|incluso)\s |)(?:de\s*gran|bastante|un\s*tanto|un\s*poco|)\s*(?:m[áa]s|menos)\s |)"

enumeration_of_noun_modifiers = direct_subject_modifiers   "(?:"   modifier_connectors    direct_subject_modifiers   ")*"

sentence_capture_pattern = r"(?:aquellas|aquellos|aquella|aquel|los|las|el|la|esos|esas|este|ese|otros|otras|otro|otra)\s "   noun_pattern   r"\s "   enumeration_of_noun_modifiers

input_text = re.sub(sentence_capture_pattern, r"((NOUN)\g<0>)", input_text, flags=re.I|re.U)
print(repr(input_text)) # output:

#The output obtained after applying this modified code is the following: 

'En esta alejada ciudad por la tarde circulan muchos camiones con ((NOUN='acoplados rojos, grandes, y bastante pesados')aquellos), llevándolos por ((NOUN='trayectos bastante empedrados, polvorientos, y un tanto arenosos')esos). Y incluso bastante desde lejos ya se les puede ver.'
  • Related