I would like to remove words from a character vector. This is how I do:
library(tm)
words = c("the", "The", "Intelligent", "this", "This")
words_to_remove = c("the", "This")
removeWords(tolower(words), tolower(words_to_remove))
This is really nice, but I would like the word "Intelligent" to be returned as it was, meaning "Intelligent" instead of "intelligent.
Is there a possibility to use the function tolower
only within the function removeWords
?
CodePudding user response:
You could just use a base R approach with grepl
here:
words_to_remove = c("the", "This")
pattern <- paste0("\\b", words_to_remove, "\\b", collapse="|")
words = c("the", "The", "Intelligent", "this", "This")
res <- grepl(pattern, words, ignore.case=TRUE)
words[!res]
[1] "Intelligent"
Demo
The trick I use here is in my call to paste
to generate the following pattern:
\bthe\b|\bThis\b
This pattern can, in a single regex evaluation, determine if any string in words
is a match to be removed.
CodePudding user response:
Here is another option using base R's %in%
function:
words = c("the", "The", "Intelligent", "this", "This")
words_to_remove = c("the", "This")
words[!(tolower(words) %in% tolower(words_to_remove))]
%in% returns TRUE for all cases where "words" are in the "words_to_remove" list. Take the inverse for the words to keep.