Context
I have a df
,where the id
refers to a different person and the fruits_eat
refers to the fruit that person eats. Also, I have a vector fruits_list
storing a list of fruits.
Question
I want to generate a new variable fruits_in_list
to indicate whether a person ate one and more fruits in the fruits_list
, but I don't know how to implement it in R.
What I've done
I checked some answers, but none of them are very relevant to my problem, like.
- R Match character vectors
- Compare two character vectors in R
- https://stackoverflow.com/search?q=How to fuzzy match two character vectors
- How to run through list of keyword vectors and fuzzy match them to a different file (R)
- Matching strings with abbreviations; fuzzy matching
Reproducible code
fruits_Jack = c('XXappleYYY,lemon,orange,pitaya')
fruits_Rose = c('Navel orange,Blood orange,watermelon,cherry')
fruits_Biden= c('pitaya,cherry,banana')
fruits_list = c('apple', 'lemon', 'orange', 'watermelon', 'peach', 'pear')
df =
data.frame(id = c('Jack', 'Rose', 'Biden'),
fruits_eat = c(fruits_Jack, fruits_Rose, fruits_Biden))
> df
id fruits_eat
1 Jack apple,lemon,orange,pitaya
2 Rose Navel orange,Blood orange,watermelon,cherry
3 Biden pitaya,cherry,banana
Expect output
df_expect = cbind(df, fruits_in_list = c(1, 1, 0))
> df_expect
id fruits_eat fruits_in_list
1 Jack apple,lemon,orange,pitaya 1
2 Rose Navel orange,Blood orange,watermelon,cherry 1
3 Biden pitaya,cherry,banana 0
CodePudding user response:
With stringr
, use str_detect
, or str_count
if you want a real count:
library(stringr)
library(dplyr)
df %>%
mutate(fruits_in_list = (str_detect(fruits_eat, paste0(fruits_list, collapse = "|"))),
count = str_count(fruits_eat, paste0(fruits_list, collapse = "|")))
id fruits_eat fruits_in_list count
1 Jack XXappleYYY,lemon,orange,pitaya 1 3
2 Rose Navel orange,Blood orange,watermelon,cherry 1 3
3 Biden pitaya,cherry,banana 0 0
CodePudding user response:
A solution using data.table
and fast if else fifelse()
, as well as the base R function grepl()
to do the matching. The "l" on the end of grepl()
stands for logical, and that means it will return a TRUE
if the pattern is matched anywhere in the string given (fruits_eat
), and a FALSE
otherwise - this means it can be passed immediately to the test argument of the if else.
The key point here is that you can paste strings "string1"
and "string2"
together separated by "|"
, and "string1|string2"
matches for "string1"
or "string2"
inside grepl()
.
library(data.table)
setDT(df)
df[, fruits_in_list := fifelse(grepl(paste0(fruits_list,
collapse = "|"), fruits_eat),1,0)]
df
id fruits_eat fruits_in_list
1: Jack XXappleYYY,lemon,orange,pitaya 1
2: Rose Navel orange,Blood orange,watermelon,cherry 1
3: Biden pitaya,cherry,banana 0