Home > Enterprise >  Match a series of single or multiple words in a given vector in R
Match a series of single or multiple words in a given vector in R

Time:07-22

I want to check if a series of words are in a given vector. This is usually not a problem in R using %in% when you just want to match one word per observation. But what happens when the observation may have more than 2 valid words?

To make it clearer

Say we have this list of words:

words <- c("hi","hello","bye","chao")

And we have as observations:

var <- c("hi", "hi; hello", "bye", "yes", "hi; hello; by")

Using :

var %in% words
is.element(var,words)

we get:

T,F,T,F,F

But what if I want that options "hi; hello" (for example) to be valid as well: I was thinking I could use some function to look for a pattern like:

words_grepl <- c("hi|hello|bye|chao")
var <- c("hi", "hi; hello", "bye", "yes", "hi; hello; by")
grepl(words_grepl,var)

Then we get:

T,T,T,F,T

Which would return something close to what I am looking for. But here the problem arises in the last element of the vector: "hi; hello; by" where "hi" and "hello" are valid but "by" is not, and I wanted a method to return T only when all the words are valid .

Is there a way to solve this?

PS: ignoring the ";" would not be a problem, I can simply use

var <- gsub(";","",c("hi", "hi; hello", "bye", "yes", "hi; hello; by"))

CodePudding user response:

You could turn each element of the vector into a list of individual words (using strsplit), and apply your first expression wrapped in all to all five elements:

varlist <- c("hi", "hi; hello", "bye", "yes", "hi; hello; by") |> strsplit("; ")
sapply(varlist, \(x) all(x %in% words))

Output:

[1]  TRUE  TRUE  TRUE FALSE FALSE

CodePudding user response:

Here is a base R solutionn.

words <- c("hi","hello","bye","chao")
var <- c("hi", "hi; hello", "bye", "yes", "hi; hello; by")
len <- lengths(strsplit(var, ";"))
rowSums(sapply(words, \(x) grepl(x, var))) == len
#> [1]  TRUE  TRUE  TRUE FALSE FALSE

Created on 2022-07-22 by the reprex package (v2.0.1)

  • Related