Home > Software design >  Problem using \\d inside a user-defined character class
Problem using \\d inside a user-defined character class

Time:12-11

I'm struggling to understand why I seem unable to include a shorthand character class such as \\d or \\w inside a user-defined character class between [and ] (although I have seen cases where such an inclusion can be done). What I want to do in this illustrative example is relocate the currency symbol from the right end of the string to the start of the string:

a_1 <- c("155.88¥","5156.04€","656","1566.1$")

sub("([\\w.] )([€$¥])", "\\2\\1", a_1)   # doesn't work
sub("([\\d.] )([€$¥])", "\\2\\1", a_1)   # doesn't work
sub("([0-9.] )([€$¥])", "\\2\\1", a_1)   # works

Why does only the fully user-defined character class work but not those that involve the shorthand character classes?

Expected result:

[1] "¥155.88"  "€5156.04" "656"      "$1566.1"

CodePudding user response:

As requested:

Character classes such as \\d, \\s, \\w are from Perl so when you use those make sure to add perl = T in your code.

For example:

sub("([\\w.] )([€$¥])", "\\2\\1", a_1, perl = T) 

More information can be found here:

https://perldoc.perl.org/perlrecharclass

CodePudding user response:

It is a feature of R's regex flavor, which is POSIX. See here that POSIX does not allow \d inside a character class but instead must use [0-9] or [[:digit:]]. According to the documentation:

There are a number of pre-built classes that you can use inside []:

[:punct:]: punctuation.
[:alpha:]: letters.
[:lower:]: lowercase letters.
[:upper:]: upperclass letters.
[:digit:]: digits.
... (and others)

Consider:

a_1 <- c("155.88¥","5156.04€","656","1566.1$")
sub("([[:digit:].] )([€$¥])", "\\2\\1", a_1)

[1] "¥155.88" "€5156.04" "656" "$1566.1"

But note that if we run sub in Perl mode, which is PCRA regex, then \d inside a character class works:

a_1 <- c("155.88¥","5156.04€","656","1566.1$")
sub("([\\d.] )([€$¥])", "\\2\\1", a_1, perl=TRUE)

[1] "¥155.88" "€5156.04" "656" "$1566.1"

Perhaps the places where you remember seeing \d inside a character class were using Perl PCRA mode.

  • Related