I'm struggling to understand why I seem unable to include a shorthand character class such as \\d
or \\w
inside a user-defined character class between [
and ]
(although I have seen cases where such an inclusion can be done). What I want to do in this illustrative example is relocate the currency symbol from the right end of the string to the start of the string:
a_1 <- c("155.88¥","5156.04€","656","1566.1$")
sub("([\\w.] )([€$¥])", "\\2\\1", a_1) # doesn't work
sub("([\\d.] )([€$¥])", "\\2\\1", a_1) # doesn't work
sub("([0-9.] )([€$¥])", "\\2\\1", a_1) # works
Why does only the fully user-defined character class work but not those that involve the shorthand character classes?
Expected result:
[1] "¥155.88" "€5156.04" "656" "$1566.1"
CodePudding user response:
As requested:
Character classes such as \\d
, \\s
, \\w
are from Perl
so when you use those make sure to add perl = T
in your code.
For example:
sub("([\\w.] )([€$¥])", "\\2\\1", a_1, perl = T)
More information can be found here:
https://perldoc.perl.org/perlrecharclass
CodePudding user response:
It is a feature of R's regex flavor, which is POSIX. See here that POSIX does not allow \d
inside a character class but instead must use [0-9]
or [[:digit:]]
. According to the documentation:
There are a number of pre-built classes that you can use inside []:
[:punct:]: punctuation.
[:alpha:]: letters.
[:lower:]: lowercase letters.
[:upper:]: upperclass letters.
[:digit:]: digits.
... (and others)
Consider:
a_1 <- c("155.88¥","5156.04€","656","1566.1$")
sub("([[:digit:].] )([€$¥])", "\\2\\1", a_1)
[1] "¥155.88" "€5156.04" "656" "$1566.1"
But note that if we run sub
in Perl mode, which is PCRA regex, then \d
inside a character class works:
a_1 <- c("155.88¥","5156.04€","656","1566.1$")
sub("([\\d.] )([€$¥])", "\\2\\1", a_1, perl=TRUE)
[1] "¥155.88" "€5156.04" "656" "$1566.1"
Perhaps the places where you remember seeing \d
inside a character class were using Perl PCRA mode.