Home > database >  Conditional replacement of characters in a string pursuant to the use certain tags
Conditional replacement of characters in a string pursuant to the use certain tags

Time:08-13

I want to replace characters in a text conditionally according to certain tags. For example, in the following string.

text <- "In Spanish, Brasil is written as <Brazil>, for some reason."

I would like to convert the text that is outside the angle brackets. I currently know how to do the opposite. I can use gsub() to identify that specific string and change some characters using the following command:

gsub("(<.*)z(.*?>)", "\\1s\\2", text)
[1] "In Spanish, Brasil is written as <Brasil>, for some reason."

But what I want to do is to change the text that is outside without affecting the text that is within angle brackets, for example:

gsub("Brasil", "Brazil", text) 
[1] "In Spanish, Brazil is written as <Brazil>, for some reason."

Expected result, where only the text outside the angle brackets is changed:

[1] "In Spanish, Brazil is written as <Brazil>, for some reason."

How could I apply the replacement conditionally so that text within the angle brackets is not affected? Should I need to split the string first, based on the presence of angle brackets, apply the replacements, and then merge all the strings? Or I could I just make it work with gsub() and a condition?

CodePudding user response:

Using lookarounds works too:

sub("(?<!<)Brasil(?!=>)", "Brazil", text, perl = TRUE)

How this works:

  • (?<!<)- negative lookbehind to assert that the next character to the left must not be a literal <
  • Brasil - the literal string Brasil
  • (?!=>)- negative lookahead to assert that the next character to the right must not be literal >

Note that if you have a single replacement per string then sub suffices. If there are more than one replacements to be made, then use gsub.

CodePudding user response:

You need to use a PCRE regex here (mind the perl=TRUE argument):

gsub("<[^<>]*>(*SKIP)(*F)|Brasil", "Brazil", text, perl=TRUE)

Details:

  • <[^<>]*>(*SKIP)(*F) - <, zero or more chars other than < and >, and then >, and the match is failed at that position and the regex engine starts searching for the next match from the failure position
  • | - or
  • Brasil - a fixed char sequence.

See the regex demo.

If you only want to "skip" matching Brasil if it is immediately preceded with < and immediately followed with >, you can use

gsub("(?<!<(?=\\w >))Brasil", "Brazil", text, perl=TRUE)

See this regex demo. Here, (?<!<(?=\w >)) is a negative lookbehind that fails the match if it is immediately preceded with a < char that is immediately followed with one or more word chars and a > char (i.e. if the Brasil is both preceded and followed with < and > chars.

See an R demo (note I replaced Brazil with Brasil inside angle brackets for better visibility):

text <- "In Spanish, Brasil is written as <Brasil>, for some reason."
gsub("<[^<>]*>(*SKIP)(*F)|Brasil", "Brazil", text, perl=TRUE)
# => [1] "In Spanish, Brazil is written as <Brasil>, for some reason."
text <- "In Spanish, Brasil is written as <Brasil>, for some reason."
gsub("(?<!<(?=\\w >))Brasil", "Brazil", text, perl=TRUE)
# => [1] "In Spanish, Brazil is written as <Brasil>, for some reason."
  • Related