Home > Software engineering >  Remove all characters except for certain characters and numbers
Remove all characters except for certain characters and numbers

Time:10-07

I'm working with string vector and I want to remove all characters except for certain characters C_Keep and any set of numbers succeeding these letters.

C_Keep <- c("ab", "acr", "sb", "scr")

For example, take the following string in a data frame:

text <- "ab 187; acr 76 98 298 876 987; legislature governors office attorney generals office re gaming issues ab 1416 165 187 267; calepa support sb 265; scr 17689 986 83783 3982"

df <- data.frame(text)

I would like the output to populate into a be

"ab 187; acr 76 98 298 876 987; ab 1416 165 187 267; sb 265; scr 17689 986 83783 3982"

If it could also be in a column in the data frame df that would be great. Such that the final results look like the object in df:

text <- "ab 187; acr 76 98 298 876 987; legislature governors office attorney generals office re gaming issues ab 1416 165 187 267; calepa support sb 265; scr 17689 986 83783 3982") 

output <- "ab 187; acr 76 98 298 876 987; ab 1416 165 187 267; sb 265; scr 17689 986 83783 3982"

df <- data.frame(text, output)
df 

Thank you so much for your help!

CodePudding user response:

Here is a tidyverse (stringr) solution. Please carefully test it on more strings to ensure that it behaves as expected on your data.

Setup:

library(tidyverse)
C_Keep <- c("ab", "acr", "sb", "scr")
text <- "ab 187; acr 76 98 298 876 987; legislature governors office attorney generals office re gaming issues ab 1416 165 187 267; calepa support sb 265; scr 17689 986 83783 3982"

Solution:

# This regex checks for any of the strings in C_Keep
keep_regex = paste(C_Keep, collapse = "|")
text %>%
  # Split into separate strings to work with one at a time
  str_split("; ") %>%
  unlist() %>%
  # Grab only the text following anything in C_Keep
  map_chr(
    str_extract, 
    # The pattern is (keep_regex), followed by numbers (or spaces) 
    # i.e., [0-9| ] 
    pattern = paste0("(", keep_regex, ")", "[0-9| ] ")
  ) %>%
  # Put it all back together into a single string
  paste(collapse = "; ")

#> "ab 187; acr 76 98 298 876 987; ab 1416 165 187 267; sb 265; scr 17689 986 83783 3982"

Update: Data frame solution

To perform this action on a set of strings that might be in a data frame, we first need to put our work into a function.

For simplicity, I'm first going to make the function only work with one string at a time. There are other ways to do this, but I don't want to make significant changes from my original work above.

# Let's make the function accept a vector of "keep_strings" as 
# a parameter in case the strings change in the future. I'm
# calling it extract_info_helper because it is only able to work
# with one string at a time. 
extract_info_helper = function(text, keep_strings) {
  # Our function will only work with one string at a time,
  # so let's check that at the start to ensure it is not misused
  stopifnot(length(text) == 1)
  
  # This regex checks for any of the strings in keep_strings
  keep_regex = paste(keep_strings, collapse = "|")
  
  # Output:
  text %>%
    # Split into separate strings to work with one at a time
    str_split("; ") %>%
    unlist() %>%
    # First grab only the text following anything in C_Keep
    map_chr(
      str_extract, 
      # The pattern is (keep_regex), followed by numbers (or spaces) 
      # i.e., [0-9| ] 
      pattern = paste0("(", keep_regex, ")", "[0-9| ] ")
    ) %>%
    # Put it all back together into a single string
    paste(collapse = "; ") 
}

Now we will make a function that accepts a vector of texts, and calls extract_info_helper() on each of them. I will use map_chr() to do this (map_chr() helps us call a function on each element of a vector, and the _chr means that the output will be a character vector).

# This one will work with a vector of texts by calling 
# extract_info_helper() on each of them, one at a time.
extract_info = function(texts, keep_string) {
  map_chr(
    texts, 
    extract_info_helper, 
    keep_string = keep_string
  )
}

Now we are ready to work within a data frame. I'll make an example data frame so that you can see how it can work.

data = tibble(
  text = c(
    "ab 123; acr 12 34 56", 
    "example ab 567",
    "eg sb 7537 9842; hi ab 384; sb 894257"
  )
)

# A tibble: 3 x 1
  text                                 
  <chr>                                
1 ab 123; acr 12 34 56                 
2 example ab 567                       
3 eg sb 7537 9842; hi ab 384; sb 894257

Now here's how you would call the function:

data %>%
  mutate(output = extract_info(text, keep_string = C_Keep))

# A tibble: 3 x 2
  text                                  output                         
  <chr>                                 <chr>                          
1 ab 123; acr 12 34 56                  ab 123; acr 12 34 56           
2 example ab 567                        ab 567                         
3 eg sb 7537 9842; hi ab 384; sb 894257 sb 7537 9842; ab 384; sb 894257

CodePudding user response:

This works with your example input/output, but there are a number of potential problems that may occur when you apply it to your actual data:

library(tidyverse)

C_Keep <- c("ab", "acr", "sb", "scr")
text <- "ab 187; acr 76 98 298 876 987; legislature governors office attorney generals office re gaming issues ab 1416 165 187 267; calepa support sb 265; scr 17689 986 83783 3982"
output <- "ab 187; acr 76 98 298 876 987; ab 1416 165 187 267; sb 265; scr 17689 986 83783 3982"
output
#> [1] "ab 187; acr 76 98 298 876 987; ab 1416 165 187 267; sb 265; scr 17689 986 83783 3982"

strings <- str_extract_all(text, "[ ;\\d]*ab[ ;\\d]*|[ ;\\d]*acr[ ;\\d]*|[ ;\\d]*sb[ ;\\d]*|[ ;\\d]*scr[ ;\\d]*",
                           simplify = TRUE) %>%
  str_replace_all("^ ", "")

result <- paste(strings, collapse = "")
result
#> [1] "ab 187; acr 76 98 298 876 987; ab 1416 165 187 267; sb 265; scr 17689 986 83783 3982"

all.equal(output, result)
#> [1] TRUE

Created on 2022-10-07 by the reprex package (v2.0.1)

  • Related