Replace multiple phrases with NA (or blank) in R-CodePudding

I am working in R.

I have some phrases that I want to remove from some text strings in a dataframe. words_remove shows the phrases I want to replace. Unless the whole exact phrase is in the string, I don't want it to be removed.

words_remove <- c("red cats", "blue dogs", "pink horse")

This is my data frame:

data <- data.frame(row_id=1:4, text = c("red cats don't exist", "I have a blue dog", "I don't like blue dogs", "I like horses"))

row_id	text
1	red cats don't exist
2	I have a blue dog
3	I don't like blue dogs
4	I like horses

I want to replace all instances of "words_remove" in "text" with NA (or even better remove them entirely).

My required output:

row_id	text
1	don't exist
2	I have a blue dog
3	I don't like
4	I like horses

In my real dataframe, there are many phrases in "words_remove" so case_when etc would be too time consuming to do I think.

Any ideas?

CodePudding user response：

You may form a regex alternation of the phrases and do a replacement on that:

words_remove <- c("red cats", "blue dogs", "pink horse")
regex <- paste0("\\s*\\b(?:", paste(words_remove, collapse="|"), ")\\b\\s*")
data$text <- gsub("^\\s |\\s $", "", gsub(regex, " ", data$text))
data

row_id              text
1      1       don't exist
2      2 I have a blue dog
3      3      I don't like
4      4     I like horses

The strategy here is to replace any matching phrase plus any surrounding whitespace with just a single space. The outer call to gsub() strips off any remaining leading/trailing whitespace.

CodePudding user response：

One way to approach this is using the stringr package or base gsub and feeding it a pattern with or operators (|):

data$text <- stringr::str_remove_all(data$text, paste0(words_remove, collapse = '|'))

data$text <- gsub(paste0(words_remove, collapse = '|'), "", data$text)