R: How can I use strings of multiple columns as patterns in gsub?-CodePudding

I need to make a column named incorrect that contains all of the words from written that do not appear in target.apple and target.banana.

recall <- data.frame(written = c("apples car banana hat pencil r", "papeer apple cars spoon", "dice banaana pen f apple berry"))
recall <- recall %>% mutate(target.apple = str_extract(written,"app([^ ] )"),
                            target.banana = str_extract(written,"bana([^ ] )"))

Example:

                           written  target.apple   target.banana           incorrect
1   apples car banana hat pencil r        apples          banana    car hat pencil r
2          papeer apple cars spoon         apple            <NA>   papeer cars spoon
3   dice banaana pen f apple berry         apple         banaana    dice pen f berry

Thank you.

CodePudding user response：

We can use dplyr with rowwise. First, be sure to tokenize the sentences (split into words)

library(dplyr)
library(tokenizers)

recall %>%
    rowwise() %>%
    mutate(incorrect = tokenize_words(written),
           incorrect = toString(incorrect[!incorrect %in% c_across(contains('target'))]))%>%
    ungroup()

# A tibble: 3 × 4
  written                        target.apple target.banana incorrect          
  <chr>                          <chr>        <chr>         <chr>              
1 apples car banana hat pencil r apples       banana        car, hat, pencil, r
2 papeer apple cars spoon        apple        NA            papeer, cars, spoon
3 dice banaana pen f apple berry apple        banaana       dice, pen, f, berry

CodePudding user response：

The NA values make this a bit tricky, as str_remove_all() doesn't handle NA patterns (or pattern = ""). The neatest way I can think of dealing with this is to create a function str_remove_any() that does handle NA patterns (by ignoring them). Then you can do something like this:

library(stringr)
library(dplyr, warn.conflicts = FALSE)

recall <- tibble(
  written = c(
    "apples car banana hat pencil r", 
    "papeer apple cars spoon", 
    "dice banaana pen f apple berry"
  )
)

str_remove_any <- function(x, pattern) {
  not_na <- !is.na(pattern)
  x[not_na] <- str_remove_all(x[not_na], pattern[not_na])
  x
}

recall %>% 
  mutate(
    target.apple = str_extract(written,"app([^ ] )"),
    target.banana = str_extract(written,"bana([^ ] )"),
    incorrect = written %>%
      str_remove_any(fixed(target.apple)) %>%
      str_remove_any(fixed(target.banana))
  )
#> # A tibble: 3 × 4
#>   written                        target.apple target.banana incorrect           
#>   <chr>                          <chr>        <chr>         <chr>               
#> 1 apples car banana hat pencil r apples       banana        " car  hat pencil r"
#> 2 papeer apple cars spoon        apple        NA            "papeer  cars spoon"
#> 3 dice banaana pen f apple berry apple        banaana       "dice  pen f  berry"

CodePudding user response：

You could simply remove all instances of app[^ ] and bana[^ ] by substituting them with the empty string:

recall$incorrect <- gsub("appl[^ ] |bana[^ ] ", "", recall$written)
recall$incorrect
[1] " car  hat pencil r" "papeer  cars spoon" "dice  pen f  berry"

If your target regexes are many, or procedurally generated, you can paste them together to create the matching pattern, using | as collapse separator

targets <- c("appl[^ ] ", "bana[^ ] ")
recall$incorrect <- gsub(paste(targets, collapse = "|"), "", recall$written)