I'm trying to remove specific strings from a data.frame column, that are matched with entries from a list of strings.
names_to_remove <- c("Peter", "Thomas Loco", "Sarah Miller", "Diana", "Burak El", "Stacy")
data$text
| text |
|Sarah Miller apple tree |
|Peter peach cake |
|Thomas Loco banana bread |
|Diana apple cookies |
|Burak El melon juice |
|Stacy maple tree |
The actual data.frame has ~50k rows, and the list has ~15k entries.
Yet I tried to replace the strings with data$text <- str_replace(data$text, regex(str_c("\\b",names_to_remove, "\\b", collapse = '|')), "name")
but this leaves me with an empty column of NA values. Do you have an idea how to solve this?
CodePudding user response:
If df is your dataframe:
df <- structure(list(text = c("Sarah Miller apple tree", "Peter peach cake", "Thomas Loco banana bread", "Diana apple cookies", "Burak El melon juice ", "Stacy maple tree ")), class = "data.frame", row.names = c(NA, -6L))
text
1 Sarah Miller apple tree
2 Peter peach cake
3 Thomas Loco banana bread
4 Diana apple cookies
5 Burak El melon juice
6 Stacy maple tree
We could do:
library(dplyr)
library(stringr)
pattern <- paste(names_to_remove, collapse = "|")
df %>%
mutate(text = str_remove(text, pattern))
text
1 apple tree
2 peach cake
3 banana bread
4 apple cookies
5 melon juice
6 maple tree