Replace words from list of words-CodePudding

I have this data frame

df <- structure(list(ID = 1:3, Text = c("there was not clostridium", "clostridium difficile positive", "test was OK but there was clostridium")), class = "data.frame", row.names = c(NA, -3L)) 
 ID                                  Text
1  1             there was not clostridium
2  2        clostridium difficile positive
3  3 test was OK but there was clostridium

And pattern of stop words

stop <- paste0(c("was", "but", "there"), collapse = "|")

I would like to go through the Text from ID and remove words from stop pattern It is important to keep order of words. I do not want to use merge functions.

I have tried this

  df$Words <- tokenizers::tokenize_words(df$Text, lowercase = TRUE) ##I would like to make a list of single words

for (i in length(df$Words)){
  
  df$clean <- lapply(df$Words, function(y) lapply(1:length(df$Words[i]),
                                                 function(x) stringr::str_replace(unlist(y) == x, stop, "REPLACED")))
  
  
}

But this gives me a vector of logical string not a list of words.

> df
  ID                                  Text                                       Words                                           clean
1  1             there was not clostridium                there, was, not, clostridium                      FALSE, FALSE, FALSE, FALSE
2  2        clostridium difficile positive            clostridium, difficile, positive                             FALSE, FALSE, FALSE
3  3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE

I would like to get this (replace all words from stop pattern and keep word order)

> df
  ID                                  Text                                       Words                                           clean
1  1             there was not clostridium                there, was, not, clostridium                      "REPLACED", "REPLACED", not, clostridium
2  2        clostridium difficile positive            clostridium, difficile, positive                             clostridium, difficile, positive
3  3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium test, "REPLACED", OK, "REPLACED", "REPLACED", "REPLACED", clostridium

CodePudding user response：

You can use data.table for it

df = as.data.table(df)[, clean := lapply(Words, function(x) gsub(stop, "REPLACED", x))]

Or you can use dplyr (and don't create column Words):

df$clean = lapply(strsplit(df$Text, " "), function(x) gsub(stop, "REPLACED", x))

CodePudding user response：

Are you trying to REMOVE the "stop words"?

Tidyverse oneliner :

library(stringr)
library(dplyr)

df %>% mutate(Words = str_remove_all(Text, stop))

ID                                  Text                          Words
1             there was not clostridium                not clostridium
2        clostridium difficile positive clostridium difficile positiv
3 test was OK but there was clostridium        test  OK    clostridium