I need to make a column named incorrect
that contains all of the words from written
that do not appear in target.apple
and target.banana
.
recall <- data.frame(written = c("apples car banana hat pencil r", "papeer apple cars spoon", "dice banaana pen f apple berry"))
recall <- recall %>% mutate(target.apple = str_extract(written,"app([^ ] )"),
target.banana = str_extract(written,"bana([^ ] )"))
Example:
written target.apple target.banana incorrect
1 apples car banana hat pencil r apples banana car hat pencil r
2 papeer apple cars spoon apple <NA> papeer cars spoon
3 dice banaana pen f apple berry apple banaana dice pen f berry
Thank you.
CodePudding user response:
We can use dplyr
with rowwise. First, be sure to tokenize
the sentences (split into words)
library(dplyr)
library(tokenizers)
recall %>%
rowwise() %>%
mutate(incorrect = tokenize_words(written),
incorrect = toString(incorrect[!incorrect %in% c_across(contains('target'))]))%>%
ungroup()
# A tibble: 3 × 4
written target.apple target.banana incorrect
<chr> <chr> <chr> <chr>
1 apples car banana hat pencil r apples banana car, hat, pencil, r
2 papeer apple cars spoon apple NA papeer, cars, spoon
3 dice banaana pen f apple berry apple banaana dice, pen, f, berry
CodePudding user response:
The NA
values make this a bit tricky, as str_remove_all()
doesn't handle NA
patterns (or pattern = ""
). The neatest way I can think of dealing with this is to create a function str_remove_any()
that does handle NA
patterns (by ignoring them). Then you can do something like this:
library(stringr)
library(dplyr, warn.conflicts = FALSE)
recall <- tibble(
written = c(
"apples car banana hat pencil r",
"papeer apple cars spoon",
"dice banaana pen f apple berry"
)
)
str_remove_any <- function(x, pattern) {
not_na <- !is.na(pattern)
x[not_na] <- str_remove_all(x[not_na], pattern[not_na])
x
}
recall %>%
mutate(
target.apple = str_extract(written,"app([^ ] )"),
target.banana = str_extract(written,"bana([^ ] )"),
incorrect = written %>%
str_remove_any(fixed(target.apple)) %>%
str_remove_any(fixed(target.banana))
)
#> # A tibble: 3 × 4
#> written target.apple target.banana incorrect
#> <chr> <chr> <chr> <chr>
#> 1 apples car banana hat pencil r apples banana " car hat pencil r"
#> 2 papeer apple cars spoon apple NA "papeer cars spoon"
#> 3 dice banaana pen f apple berry apple banaana "dice pen f berry"
CodePudding user response:
You could simply remove all instances of app[^ ]
and bana[^ ]
by substituting them with the empty string:
recall$incorrect <- gsub("appl[^ ] |bana[^ ] ", "", recall$written)
recall$incorrect
[1] " car hat pencil r" "papeer cars spoon" "dice pen f berry"
If your target regexes are many, or procedurally generated, you can paste
them together to create the matching pattern, using |
as collapse separator
targets <- c("appl[^ ] ", "bana[^ ] ")
recall$incorrect <- gsub(paste(targets, collapse = "|"), "", recall$written)