I'm working with string vector and I want to remove all characters except for certain characters C_Keep
and any set of numbers succeeding these letters.
C_Keep <- c("ab", "acr", "sb", "scr")
For example, take the following string in a data frame:
text <- "ab 187; acr 76 98 298 876 987; legislature governors office attorney generals office re gaming issues ab 1416 165 187 267; calepa support sb 265; scr 17689 986 83783 3982"
df <- data.frame(text)
I would like the output to populate into a be
"ab 187; acr 76 98 298 876 987; ab 1416 165 187 267; sb 265; scr 17689 986 83783 3982"
If it could also be in a column in the data frame df
that would be great. Such that the final results look like the object in df
:
text <- "ab 187; acr 76 98 298 876 987; legislature governors office attorney generals office re gaming issues ab 1416 165 187 267; calepa support sb 265; scr 17689 986 83783 3982")
output <- "ab 187; acr 76 98 298 876 987; ab 1416 165 187 267; sb 265; scr 17689 986 83783 3982"
df <- data.frame(text, output)
df
Thank you so much for your help!
CodePudding user response:
Here is a tidyverse
(stringr
) solution. Please carefully test it on more strings to ensure that it behaves as expected on your data.
Setup:
library(tidyverse)
C_Keep <- c("ab", "acr", "sb", "scr")
text <- "ab 187; acr 76 98 298 876 987; legislature governors office attorney generals office re gaming issues ab 1416 165 187 267; calepa support sb 265; scr 17689 986 83783 3982"
Solution:
# This regex checks for any of the strings in C_Keep
keep_regex = paste(C_Keep, collapse = "|")
text %>%
# Split into separate strings to work with one at a time
str_split("; ") %>%
unlist() %>%
# Grab only the text following anything in C_Keep
map_chr(
str_extract,
# The pattern is (keep_regex), followed by numbers (or spaces)
# i.e., [0-9| ]
pattern = paste0("(", keep_regex, ")", "[0-9| ] ")
) %>%
# Put it all back together into a single string
paste(collapse = "; ")
#> "ab 187; acr 76 98 298 876 987; ab 1416 165 187 267; sb 265; scr 17689 986 83783 3982"
Update: Data frame solution
To perform this action on a set of strings that might be in a data frame, we first need to put our work into a function.
For simplicity, I'm first going to make the function only work with one string at a time. There are other ways to do this, but I don't want to make significant changes from my original work above.
# Let's make the function accept a vector of "keep_strings" as
# a parameter in case the strings change in the future. I'm
# calling it extract_info_helper because it is only able to work
# with one string at a time.
extract_info_helper = function(text, keep_strings) {
# Our function will only work with one string at a time,
# so let's check that at the start to ensure it is not misused
stopifnot(length(text) == 1)
# This regex checks for any of the strings in keep_strings
keep_regex = paste(keep_strings, collapse = "|")
# Output:
text %>%
# Split into separate strings to work with one at a time
str_split("; ") %>%
unlist() %>%
# First grab only the text following anything in C_Keep
map_chr(
str_extract,
# The pattern is (keep_regex), followed by numbers (or spaces)
# i.e., [0-9| ]
pattern = paste0("(", keep_regex, ")", "[0-9| ] ")
) %>%
# Put it all back together into a single string
paste(collapse = "; ")
}
Now we will make a function that accepts a vector of texts, and calls extract_info_helper()
on each of them. I will use map_chr()
to do this (map_chr()
helps us call a function on each element of a vector, and the _chr
means that the output will be a character vector).
# This one will work with a vector of texts by calling
# extract_info_helper() on each of them, one at a time.
extract_info = function(texts, keep_string) {
map_chr(
texts,
extract_info_helper,
keep_string = keep_string
)
}
Now we are ready to work within a data frame. I'll make an example data frame so that you can see how it can work.
data = tibble(
text = c(
"ab 123; acr 12 34 56",
"example ab 567",
"eg sb 7537 9842; hi ab 384; sb 894257"
)
)
# A tibble: 3 x 1
text
<chr>
1 ab 123; acr 12 34 56
2 example ab 567
3 eg sb 7537 9842; hi ab 384; sb 894257
Now here's how you would call the function:
data %>%
mutate(output = extract_info(text, keep_string = C_Keep))
# A tibble: 3 x 2
text output
<chr> <chr>
1 ab 123; acr 12 34 56 ab 123; acr 12 34 56
2 example ab 567 ab 567
3 eg sb 7537 9842; hi ab 384; sb 894257 sb 7537 9842; ab 384; sb 894257
CodePudding user response:
This works with your example input/output, but there are a number of potential problems that may occur when you apply it to your actual data:
library(tidyverse)
C_Keep <- c("ab", "acr", "sb", "scr")
text <- "ab 187; acr 76 98 298 876 987; legislature governors office attorney generals office re gaming issues ab 1416 165 187 267; calepa support sb 265; scr 17689 986 83783 3982"
output <- "ab 187; acr 76 98 298 876 987; ab 1416 165 187 267; sb 265; scr 17689 986 83783 3982"
output
#> [1] "ab 187; acr 76 98 298 876 987; ab 1416 165 187 267; sb 265; scr 17689 986 83783 3982"
strings <- str_extract_all(text, "[ ;\\d]*ab[ ;\\d]*|[ ;\\d]*acr[ ;\\d]*|[ ;\\d]*sb[ ;\\d]*|[ ;\\d]*scr[ ;\\d]*",
simplify = TRUE) %>%
str_replace_all("^ ", "")
result <- paste(strings, collapse = "")
result
#> [1] "ab 187; acr 76 98 298 876 987; ab 1416 165 187 267; sb 265; scr 17689 986 83783 3982"
all.equal(output, result)
#> [1] TRUE
Created on 2022-10-07 by the reprex package (v2.0.1)