Match words using this regex pattern, only if these words do not appear within a list of substrings-CodePudding

import re

input_text = "a áshgdhSdah saasas a corrEr, assasass a saltó sasass a sdssaa" #example

list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "corré", "corrió", "corría", "correr", "saltó", "salta", "salto", "circularías", "circularía", "circulando", "circula", "consiste", "consistían", "consistía", "consistió", "ladró", "ladrando", "ladra", "visualizar", "ver", "vieron", "vió"]

noun_pattern = r"((?:\w ))" # pattern that doesnt tolerate whitespace in middle

imput_text = re.sub(r"(?:^|\s )a\s "   noun_pattern, 
                    "\(\g<0>\)", 
                    input_text, re.IGNORECASE)

print(repr(input_text)) # --> output

I need the regex to identify and replace a substring containing no whitespaces in between "((?:\w ))" when it is at the beginning of the line or preceded by "a", "(?:^|\s )a\s ", only if "((?:\w ))" does not match any of the strings that are inside the list list_verbs_in_this_input or a dot . , using a regex pattern similar to this re.compile(r"(?:" rf"({'|'.join(list_verbs_in_this_input)})" r"|[.;\n]|$)", flags = re.IGNORECASE)

And the correct output should look like this:

'(áshgdhSdah) saasas a corrEr, assasass a saltó sasass (sdssaa)'

Note that the substrings "a corrEr" and "a saltó" were not modified, since they contained substring(words) that are in the list_verbs_in_this_input list

CodePudding user response：

To exclude some words, you can use a negative look ahead assertion when at the start of the word you're about to match.

A few things to correct:

re.sub takes the flags as 5^th argument, not 4^th
"\(" is not an escape sequence, so you should just do "(\g<0>)" without "escaping" the parentheses -- they have no special meaning in that string.
r"(?:^|\s )a\s " will always require the a to be there. From your description I understood that the a could be optional when the word is at the start of a line, so r"(?:\ba\s|^)\s*"
In the regex that should match the forbidden words, make sure to require that the word ends right after the match, so add \b in the pattern.

Here is what you could do:

import re

input_text = "a áshgdhSdah saasas a corrEr, assasass a saltó sasass a sdssaa" #example

list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "corré", "corrió", "corría", "correr", "saltó", "salta", "salto", "circularías", "circularía", "circulando", "circula", "consiste", "consistían", "consistía", "consistió", "ladró", "ladrando", "ladra", "visualizar", "ver", "vieron", "vió"]

noun_pattern = r"\w "
exclude = rf"(?!\b(?:{'|'.join(list_verbs_in_this_input)})\b)"
article = r"(?:\ba\s|^)\s*"
regex = article   exclude   noun_pattern

input_text = re.sub(regex, "(\g<0>)", input_text, flags=re.I|re.U)

print(repr(input_text))