import re
input_text = "a áshgdhSdah saasas a corrEr, assasass a saltó sasass a sdssaa" #example
list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "corré", "corrió", "corría", "correr", "saltó", "salta", "salto", "circularías", "circularía", "circulando", "circula", "consiste", "consistían", "consistía", "consistió", "ladró", "ladrando", "ladra", "visualizar", "ver", "vieron", "vió"]
noun_pattern = r"((?:\w ))" # pattern that doesnt tolerate whitespace in middle
imput_text = re.sub(r"(?:^|\s )a\s " noun_pattern,
"\(\g<0>\)",
input_text, re.IGNORECASE)
print(repr(input_text)) # --> output
I need the regex to identify and replace a substring containing no whitespaces in between "((?:\w ))"
when it is at the beginning of the line or preceded by "a", "(?:^|\s )a\s "
, only if "((?:\w ))"
does not match any of the strings that are inside the list list_verbs_in_this_input
or a dot .
, using a regex pattern similar to this re.compile(r"(?:" rf"({'|'.join(list_verbs_in_this_input)})" r"|[.;\n]|$)", flags = re.IGNORECASE)
And the correct output should look like this:
'(áshgdhSdah) saasas a corrEr, assasass a saltó sasass (sdssaa)'
Note that the substrings "a corrEr"
and "a saltó"
were not modified, since they contained substring(words) that are in the list_verbs_in_this_input
list
CodePudding user response:
To exclude some words, you can use a negative look ahead assertion when at the start of the word you're about to match.
A few things to correct:
re.sub
takes the flags as 5th argument, not 4th"\("
is not an escape sequence, so you should just do"(\g<0>)"
without "escaping" the parentheses -- they have no special meaning in that string.r"(?:^|\s )a\s "
will always require thea
to be there. From your description I understood that thea
could be optional when the word is at the start of a line, sor"(?:\ba\s|^)\s*"
- In the regex that should match the forbidden words, make sure to require that the word ends right after the match, so add
\b
in the pattern.
Here is what you could do:
import re
input_text = "a áshgdhSdah saasas a corrEr, assasass a saltó sasass a sdssaa" #example
list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "corré", "corrió", "corría", "correr", "saltó", "salta", "salto", "circularías", "circularía", "circulando", "circula", "consiste", "consistían", "consistía", "consistió", "ladró", "ladrando", "ladra", "visualizar", "ver", "vieron", "vió"]
noun_pattern = r"\w "
exclude = rf"(?!\b(?:{'|'.join(list_verbs_in_this_input)})\b)"
article = r"(?:\ba\s|^)\s*"
regex = article exclude noun_pattern
input_text = re.sub(regex, "(\g<0>)", input_text, flags=re.I|re.U)
print(repr(input_text))