I am working in R.
I have some phrases that I want to remove from some text strings in a dataframe. words_remove shows the phrases I want to replace. Unless the whole exact phrase is in the string, I don't want it to be removed.
words_remove <- c("red cats", "blue dogs", "pink horse")
This is my data frame:
data <- data.frame(row_id=1:4, text = c("red cats don't exist", "I have a blue dog", "I don't like blue dogs", "I like horses"))
row_id | text |
---|---|
1 | red cats don't exist |
2 | I have a blue dog |
3 | I don't like blue dogs |
4 | I like horses |
I want to replace all instances of "words_remove" in "text" with NA (or even better remove them entirely).
My required output:
row_id | text |
---|---|
1 | don't exist |
2 | I have a blue dog |
3 | I don't like |
4 | I like horses |
In my real dataframe, there are many phrases in "words_remove" so case_when etc would be too time consuming to do I think.
Any ideas?
CodePudding user response:
You may form a regex alternation of the phrases and do a replacement on that:
words_remove <- c("red cats", "blue dogs", "pink horse")
regex <- paste0("\\s*\\b(?:", paste(words_remove, collapse="|"), ")\\b\\s*")
data$text <- gsub("^\\s |\\s $", "", gsub(regex, " ", data$text))
data
row_id text
1 1 don't exist
2 2 I have a blue dog
3 3 I don't like
4 4 I like horses
The strategy here is to replace any matching phrase plus any surrounding whitespace with just a single space. The outer call to gsub()
strips off any remaining leading/trailing whitespace.
CodePudding user response:
One way to approach this is using the stringr
package or base gsub
and feeding it a pattern with or
operators (|
):
data$text <- stringr::str_remove_all(data$text, paste0(words_remove, collapse = '|'))
data$text <- gsub(paste0(words_remove, collapse = '|'), "", data$text)