Home > front end >  Cleaning Values in R
Cleaning Values in R

Time:05-05

How to remove roman numerals I/II/III, parentheses (), anything in parentheses (xyz), dashes -, semi-colons ;, and Grades Grade 21 from the characters in this dataframe?

#Original dataframe
Jobs <- c("Social Worker I (Child Welfare Services), Grade 21", "Engineer I/II/III, Grade 19/22/25", "Legislative Attorney; Grade 32")
df <- data.frame(Jobs)
df

Dataframe to look like this:

#dataframe
Jobs <- c("Social Worker", "Engineer", "Legislative Attorney")
df1 <- data.frame(Jobs)
df1

CodePudding user response:

You can use regular expression to remove matching substrings:

library(tidyverse)

Jobs <- c("Social Worker I (Child Welfare Services), Grade 21", "Engineer I/II/III, Grade 19/22/25", "Legislative Attorney; Grade 32")
df <- data.frame(Jobs)

df %>%
  mutate(Jobs = Jobs %>%
    str_remove_all("I|II|III|Grade [0-9/] |[-;]") %>%
    str_remove_all("[/,]") %>%
    str_remove_all("[(][^(] [)]") %>%
    str_trim())
#>                   Jobs
#> 1        Social Worker
#> 2             Engineer
#> 3 Legislative Attorney

Created on 2022-05-05 by the reprex package (v2.0.0)

CodePudding user response:

I am not a regex expert, but I hope the following gsub option could help

> trimws(gsub("(\\b((I )/?) \\b)|\\(.*?\\)|[-;,]|(Grade\\s\\S )", "", Jobs))
[1] "Social Worker"        "Engineer"             "Legislative Attorney"
  • Related