Home > Enterprise >  R remove strings from a column matched in a list
R remove strings from a column matched in a list

Time:09-17

I'm trying to remove specific strings from a data.frame column, that are matched with entries from a list of strings.

names_to_remove <- c("Peter", "Thomas Loco", "Sarah Miller", "Diana", "Burak El", "Stacy")

data$text
| text | 
|Sarah Miller apple tree |
|Peter peach cake |
|Thomas Loco banana bread |
|Diana apple cookies |
|Burak El melon juice |
|Stacy maple tree |

The actual data.frame has ~50k rows, and the list has ~15k entries.

Yet I tried to replace the strings with data$text <- str_replace(data$text, regex(str_c("\\b",names_to_remove, "\\b", collapse = '|')), "name") but this leaves me with an empty column of NA values. Do you have an idea how to solve this?

CodePudding user response:

If df is your dataframe:

df <- structure(list(text = c("Sarah Miller apple tree", "Peter peach cake", "Thomas Loco banana bread", "Diana apple cookies", "Burak El melon juice ", "Stacy maple tree ")), class = "data.frame", row.names = c(NA, -6L))

                      text
1  Sarah Miller apple tree
2         Peter peach cake
3 Thomas Loco banana bread
4      Diana apple cookies
5    Burak El melon juice 
6        Stacy maple tree 

We could do:

library(dplyr)
library(stringr)

pattern <- paste(names_to_remove, collapse = "|")

df %>% 
  mutate(text = str_remove(text, pattern))
            text
1     apple tree
2     peach cake
3   banana bread
4  apple cookies
5   melon juice 
6    maple tree 
  • Related