Home > Blockchain >  Writing a list of special characters in regex R
Writing a list of special characters in regex R

Time:10-05

I'm trying to write a list of special character in regex (Rstudio) but it don't work for one or two.

my list is : " / \ ? * : [ ] "

For example:

library(tidyverse)

a <- "test:e?xample"

str_replace_all(a, "[/ \ ? * : [ ]]", "_")

[1] Output : "test_e_xample"

It works!

But with "[" doesn't

a <- "test:e[xample"

str_replace_all(a, "[/ \ ? * : [ ]]", "_")

[1] Output : "test_e[xample"

CodePudding user response:

In base R, in a character class the metacharacter "]" must be the first.

a <- "test:e?xample"
b <- "test:e[xample"

gsub("[][/\\?*:]", "_", a)
#> [1] "test_e_xample"

gsub("[][/\\?*:]", "_", b)
#> [1] "test_e_xample"

Created on 2022-10-03 with reprex v2.0.2

CodePudding user response:

You are using the stringr package that uses the ICU regex flavor. In that regex flavor, in bracket expressions, the [ and ] characters are special and thus must be escaped:

str_replace_all(a, "[/\\\\?*:\\[\\]]", "_")

Mind the double escaping of \, [ and ]. In an ICU bracket expression, two literal backslashes (four in the string literal) are used to match a literal \ char.

If you plan to use base R equivalents, mind the difference between the two engines, TRE (used with perl=FALSE or with this argument omitted) and PCRE (when perl=TRUE is used):

gsub("[][/\\?*:]", "_", a)
gsub("[][/\\\\?*:]", "_", a, perl=TRUE)

The first TRE based gsub contains a single literal \ in the bracket expression, while the PCRE regex in the last gsub contains two literal backslashes (same as the ICU regex flavor). The thing is that the TRE regex bracket expression does not allow escaping special characters, that is why "smart placing" technique is used and the single backslash matches a literal backslash in the string.

See an R demo:

library(stringr)
a <- "test:e[xample\\"
str_replace_all(a, "[/\\\\?*:\\[\\]]", "_") # => [1] "test_e_xample_"
gsub("[][/\\?*:]", "_", a)                  # => [1] "test_e_xample_"
gsub("[][/\\\\?*:]", "_", a, perl=TRUE)     # => [1] "test_e_xample_"
  • Related