Str_replace Regex in R-CodePudding

I am trying to figure out what would be the best way to remove the parenthesis and everything inside of it using str_detect and regex. I've found responses that assist in removing parenthesis with text inside of it and numbers but I am having issues when it comes to detecting the $ and the possibility of having a comma in the value.

I have tried the following regex $[^()]*$ which seems to work in https://regex101.com/ but when I run it on the sample code there are no changes applied to the column.

Any pointers would be appreciated!

Sample Data:
Example <- data.frame(Column1 = c(
  "Pineapple ($1,000)", 
  "($50,000) Roger", 
  "($1,000)", 
  "First ($100), Second ($1,000)"))

Output <- Example %>%
mutate(Column1 =  str_replace(Column1, "\([^()]*\)", ""))

Managed to get an output using gsub but still wondering what would be the tidyverse approach.

Example$Column1 <- gsub("\\([^()]*\\)", "", Example$Column1)

CodePudding user response：

It's not clear to me how you want to deal with entries where you have more than one number. That aside and generally, a more convenient option is to use readr::parse_number, rather than using stringr::str_detect/stringr::str_remove. parse_number takes care of additional text, units and thousands separators.

If you want to keep only the first number (in the cases where there are more than one number per entry), you can do

library(tidyverse)
Example %>% mutate(Column1 = parse_number(Column1))
#  Column1
#1    1000
#2   50000
#3    1000
#4     100

Or if you want to keep both/multiple numbers, I suggest using separate_rows to separate entries based on a comma followed by a whitespace, before using readr::parse_number.

Example %>%
    separate_rows(Column1, sep = ",\\s") %>%
    mutate(Column1 = parse_number(Column1))
## A tibble: 5 × 1
#  Column1
#    <dbl>
#1    1000
#2   50000
#3    1000
#4     100
#5    1000

Update

To separate keys and values, here is an option; please see inline comments for explanations:

library(tidyverse)
Example %>%
    # Separate multiple comma-separated entries into rows
    separate_rows(Column1, sep = ",\\s") %>%
    # Swap "(value) key" > "key (value)" %>%
    mutate(Column1 = str_replace(
        Column1, "^(\\(. \\))\\s(\\w )$", "\\2 \\1")) %>%
    # Separate "key (value)" into columns
    separate(Column1, c("key", "value"), sep = "\\s", fill = "left") %>%
    # Parse number
    mutate(value = parse_number(value))
## A tibble: 5 × 2
#  key       value
#  <chr>     <dbl>
#1 Pineapple  1000
#2 Roger     50000
#3 NA         1000
#4 First       100
#5 Second     1000

Sample data

Example <- data.frame(Column1 = c(
    "Pineapple ($1,000)", 
    "($50,000) Roger", 
    "($1,000)", 
    "First ($100), Second ($1,000)"))

CodePudding user response：

You can use str_replace_all and add another \ when escaping \ in the regex.

library(tidyverse)
Example %>% mutate(Column1 =  str_replace_all(Column1, "\\([^()]*\\)", ""))
#          Column1
#1      Pineapple 
#2           Roger
#3                
#4 First , Second

CodePudding user response：

If I've got your requirements correctly, a very simple base-r gsub will sort you out, including tidying up the stray spaces:

gsub(" ?\\([^()]*\\) ?", "", Example$Column1)
[1] "Pineapple"     "Roger"         ""              "First, Second"

I'm not sure what you mean by a "tidyverse approach": these are just regular expressions, they are not specific to a package or even to R. If you prefer to use verbose wrappers you can use stringr::str_replace_all with the same exact pattern.