R - removing substring in column of strings based on pattern and condition-CodePudding

I have a column of strings in a data frame where I would like to replace the values to include only the substring before the first " (", i.e., before the first space/open bracket pair. Not all of the strings contain brackets, and I want those to be left as they are.

Example data:

col1 <- c(1, 2, 3, 4)
col2 <- c("a b (ABC DE)", "bcd", "cd ef (CE)", "bcd")
df <- data.frame(col1, col2)
df

Output:

  col1       col2
1    1 a b (ABC DE)
2    2        bcd
3    3  cd ef (CE)
4    4        bcd

The output I'm looking for would be something like this:

col1 <- c(1, 2, 3, 4)
col2 <- c("a b", "bcd", "cd ef", "bcd")
df <- data.frame(col1, col2)
df

Output:

  col1 col2
1    1  a b
2    2  bcd
3    3 cd ef
4    4  bcd

The actual data frame is 40000 rows with the strings taking many possible values, so it can't be done manually like in the example. I'm not confident at all working with regex/patterns, but accept this may be the most straightforward way to do this.

CodePudding user response：

Here's a dplyr method

library(dplyr)
library(stringr)

df %>% 
  mutate(across(everything(), ~str_replace_all(., "\\(. ?\\)", "")))

Which returns the df:

  col1  col2
1    1   ab 
2    2   bcd
3    3 cedf 
4    4   bcd

CodePudding user response：

For a non-regex friendly option, use tidyr::separate:

tidyr::separate(df, col2, into = "col2", extra = "drop")

  col1 col2
1    1   ab
2    2  bcd
3    3 cedf
4    4  bcd

CodePudding user response：

I'd rather use a regex than substrings.

transform(df, col2=gsub('\\s .*', '', x))
#   col1 col2
# 1    1   ab
# 2    2  bcd
# 3    3 cedf
# 4    4  bcd

CodePudding user response：

Using R base gsub

> df$col2 <- gsub("\\s*\\(.*\\)", "", df$col2)
> df
  col1 col2
1    1   ab
2    2  bcd
3    3 cedf
4    4  bcd

CodePudding user response：

A possible solution, based on stringr:

library(tidyverse)

df %>% 
  mutate(col2 = str_remove_all(col2, "\\s*\\(.*\\)\\s*"))

#>   col1 col2
#> 1    1   ab
#> 2    2  bcd
#> 3    3 cedf
#> 4    4  bcd