Home > Software engineering >  Replacing phone numbers in different formats in R
Replacing phone numbers in different formats in R

Time:01-28

I am using a regex that is suggested here to repleace any type of phone numbers with aaaaaaaaaa. This a snapshot of my data :

df <- data.frame(
  text = c(
    'my number is (123)-416-567',
    "1 321 124 7889 is valid",
    'why not taking 987-012-6782',
    '120 967 3256 is correct',
    'call at 888 969 9919',
    'please text at 1 647 989 1213'
  )
)

df %>% select(text)

                           text
1    my number is (123)-416-567
2       1 321 124 7889 is valid
3   why not taking 987-012-6782
4       120 967 3256 is correct
5          call at 888 969 9919
6 please text at 1 647 989 1213

My code is

df %>% 
  mutate(
    text = str_replace_all(text, '^(\ \d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$', 'aaaaaaaaaa')
  )

and I get this error

Error: '\ ' is an unrecognized escape in character string starting "'^(\ "
Error: unexpected ')' in "  )"

The outcome should be like

                           text
1           my number is aaaaaaaaaa
2           aaaaaaaaaa is valid
3           why not taking aaaaaaaaaa
4           aaaaaaaaaa is correct
5          call at aaaaaaaaaa
6          please text at  aaaaaaaaaa

CodePudding user response:

You can use

str_replace_all(text, '(?:\\ ?\\d{1,2}\\s)?\\(?\\d{3}\\)?[\\s.-]\\d{3}[\\s.-]\\d{3,4}(?!\\d)', 'aaaaaaaaaa')

See the regex demo.

Details:

  • (?:\ ?\d{1,2}\s)? - an optional sequence of an optional and then one or two digits and a whitespace
  • \(? - an optional (
  • \d{3} - three digits
  • \)? - an optional )
  • [\s.-] - a -, . or whitespace
  • \d{3} - three digits
  • [\s.-] - a -, . or whitespace
  • \d{3,4} - three or four digits
  • (?!\d) - no digit alowed right after.

Notes:

  • In a string literal, a backslash is defined with double \ char
  • ^ and $ match start/end of string so in this case, it makes sense to remove the ^ anchor, and replace $ with a right-digit boundary
  • The last \d{3} did not match numbers where the last part contained four digits, so I replaced it with \d{3,4}.
  • Related