I have a column of strings in a data frame where I would like to replace the values to include only the substring before the first " ("
, i.e., before the first space/open bracket pair. Not all of the strings contain brackets, and I want those to be left as they are.
Example data:
col1 <- c(1, 2, 3, 4)
col2 <- c("a b (ABC DE)", "bcd", "cd ef (CE)", "bcd")
df <- data.frame(col1, col2)
df
Output:
col1 col2
1 1 a b (ABC DE)
2 2 bcd
3 3 cd ef (CE)
4 4 bcd
The output I'm looking for would be something like this:
col1 <- c(1, 2, 3, 4)
col2 <- c("a b", "bcd", "cd ef", "bcd")
df <- data.frame(col1, col2)
df
Output:
col1 col2
1 1 a b
2 2 bcd
3 3 cd ef
4 4 bcd
The actual data frame is 40000 rows with the strings taking many possible values, so it can't be done manually like in the example. I'm not confident at all working with regex/patterns, but accept this may be the most straightforward way to do this.
CodePudding user response:
Here's a dplyr
method
library(dplyr)
library(stringr)
df %>%
mutate(across(everything(), ~str_replace_all(., "\\(. ?\\)", "")))
Which returns the df
:
col1 col2
1 1 ab
2 2 bcd
3 3 cedf
4 4 bcd
CodePudding user response:
For a non-regex friendly option, use tidyr::separate
:
tidyr::separate(df, col2, into = "col2", extra = "drop")
col1 col2
1 1 ab
2 2 bcd
3 3 cedf
4 4 bcd
CodePudding user response:
I'd rather use a regex than substrings.
transform(df, col2=gsub('\\s .*', '', x))
# col1 col2
# 1 1 ab
# 2 2 bcd
# 3 3 cedf
# 4 4 bcd
CodePudding user response:
Using R base gsub
> df$col2 <- gsub("\\s*\\(.*\\)", "", df$col2)
> df
col1 col2
1 1 ab
2 2 bcd
3 3 cedf
4 4 bcd
CodePudding user response:
A possible solution, based on stringr
:
library(tidyverse)
df %>%
mutate(col2 = str_remove_all(col2, "\\s*\\(.*\\)\\s*"))
#> col1 col2
#> 1 1 ab
#> 2 2 bcd
#> 3 3 cedf
#> 4 4 bcd