How to remove roman numerals I/II/III
, parentheses ()
, anything in parentheses (xyz)
, dashes -
, semi-colons ;
, and Grades Grade 21
from the characters in this dataframe?
#Original dataframe
Jobs <- c("Social Worker I (Child Welfare Services), Grade 21", "Engineer I/II/III, Grade 19/22/25", "Legislative Attorney; Grade 32")
df <- data.frame(Jobs)
df
Dataframe to look like this:
#dataframe
Jobs <- c("Social Worker", "Engineer", "Legislative Attorney")
df1 <- data.frame(Jobs)
df1
CodePudding user response:
You can use regular expression to remove matching substrings:
library(tidyverse)
Jobs <- c("Social Worker I (Child Welfare Services), Grade 21", "Engineer I/II/III, Grade 19/22/25", "Legislative Attorney; Grade 32")
df <- data.frame(Jobs)
df %>%
mutate(Jobs = Jobs %>%
str_remove_all("I|II|III|Grade [0-9/] |[-;]") %>%
str_remove_all("[/,]") %>%
str_remove_all("[(][^(] [)]") %>%
str_trim())
#> Jobs
#> 1 Social Worker
#> 2 Engineer
#> 3 Legislative Attorney
Created on 2022-05-05 by the reprex package (v2.0.0)
CodePudding user response:
I am not a regex expert, but I hope the following gsub
option could help
> trimws(gsub("(\\b((I )/?) \\b)|\\(.*?\\)|[-;,]|(Grade\\s\\S )", "", Jobs))
[1] "Social Worker" "Engineer" "Legislative Attorney"