I want to replace characters in a text conditionally according to certain tags. For example, in the following string.
text <- "In Spanish, Brasil is written as <Brazil>, for some reason."
I would like to convert the text that is outside the angle brackets. I currently know how to do the opposite. I can use gsub() to identify that specific string and change some characters using the following command:
gsub("(<.*)z(.*?>)", "\\1s\\2", text)
[1] "In Spanish, Brasil is written as <Brasil>, for some reason."
But what I want to do is to change the text that is outside without affecting the text that is within angle brackets, for example:
gsub("Brasil", "Brazil", text)
[1] "In Spanish, Brazil is written as <Brazil>, for some reason."
Expected result, where only the text outside the angle brackets is changed:
[1] "In Spanish, Brazil is written as <Brazil>, for some reason."
How could I apply the replacement conditionally so that text within the angle brackets is not affected? Should I need to split the string first, based on the presence of angle brackets, apply the replacements, and then merge all the strings? Or I could I just make it work with gsub() and a condition?
CodePudding user response:
Using lookarounds works too:
sub("(?<!<)Brasil(?!=>)", "Brazil", text, perl = TRUE)
How this works:
(?<!<)
- negative lookbehind to assert that the next character to the left must not be a literal<
Brasil
- the literal stringBrasil
(?!=>)
- negative lookahead to assert that the next character to the right must not be literal>
Note that if you have a single replacement per string then sub
suffices. If there are more than one replacements to be made, then use gsub
.
CodePudding user response:
You need to use a PCRE regex here (mind the perl=TRUE
argument):
gsub("<[^<>]*>(*SKIP)(*F)|Brasil", "Brazil", text, perl=TRUE)
Details:
<[^<>]*>(*SKIP)(*F)
-<
, zero or more chars other than<
and>
, and then>
, and the match is failed at that position and the regex engine starts searching for the next match from the failure position|
- orBrasil
- a fixed char sequence.
See the regex demo.
If you only want to "skip" matching Brasil
if it is immediately preceded with <
and immediately followed with >
, you can use
gsub("(?<!<(?=\\w >))Brasil", "Brazil", text, perl=TRUE)
See this regex demo. Here, (?<!<(?=\w >))
is a negative lookbehind that fails the match if it is immediately preceded with a <
char that is immediately followed with one or more word chars and a >
char (i.e. if the Brasil
is both preceded and followed with <
and >
chars.
See an R demo (note I replaced Brazil
with Brasil
inside angle brackets for better visibility):
text <- "In Spanish, Brasil is written as <Brasil>, for some reason."
gsub("<[^<>]*>(*SKIP)(*F)|Brasil", "Brazil", text, perl=TRUE)
# => [1] "In Spanish, Brazil is written as <Brasil>, for some reason."
text <- "In Spanish, Brasil is written as <Brasil>, for some reason."
gsub("(?<!<(?=\\w >))Brasil", "Brazil", text, perl=TRUE)
# => [1] "In Spanish, Brazil is written as <Brasil>, for some reason."