I want to check if a series of words are in a given vector. This is usually not a problem in R using %in% when you just want to match one word per observation. But what happens when the observation may have more than 2 valid words?
To make it clearer
Say we have this list of words:
words <- c("hi","hello","bye","chao")
And we have as observations:
var <- c("hi", "hi; hello", "bye", "yes", "hi; hello; by")
Using :
var %in% words
is.element(var,words)
we get:
T,F,T,F,F
But what if I want that options "hi; hello" (for example) to be valid as well: I was thinking I could use some function to look for a pattern like:
words_grepl <- c("hi|hello|bye|chao")
var <- c("hi", "hi; hello", "bye", "yes", "hi; hello; by")
grepl(words_grepl,var)
Then we get:
T,T,T,F,T
Which would return something close to what I am looking for. But here the problem arises in the last element of the vector: "hi; hello; by" where "hi" and "hello" are valid but "by" is not, and I wanted a method to return T only when all the words are valid .
Is there a way to solve this?
PS: ignoring the ";" would not be a problem, I can simply use
var <- gsub(";","",c("hi", "hi; hello", "bye", "yes", "hi; hello; by"))
CodePudding user response:
You could turn each element of the vector into a list of individual words (using strsplit
), and apply your first expression wrapped in all
to all five elements:
varlist <- c("hi", "hi; hello", "bye", "yes", "hi; hello; by") |> strsplit("; ")
sapply(varlist, \(x) all(x %in% words))
Output:
[1] TRUE TRUE TRUE FALSE FALSE
CodePudding user response:
Here is a base R solutionn.
words <- c("hi","hello","bye","chao")
var <- c("hi", "hi; hello", "bye", "yes", "hi; hello; by")
len <- lengths(strsplit(var, ";"))
rowSums(sapply(words, \(x) grepl(x, var))) == len
#> [1] TRUE TRUE TRUE FALSE FALSE
Created on 2022-07-22 by the reprex package (v2.0.1)