Home > Software engineering >  How to fuzzy match two character vectors in r
How to fuzzy match two character vectors in r

Time:09-05

Context

I have a df,where the id refers to a different person and the fruits_eat refers to the fruit that person eats. Also, I have a vector fruits_list storing a list of fruits.

Question

I want to generate a new variable fruits_in_list to indicate whether a person ate one and more fruits in the fruits_list, but I don't know how to implement it in R.

What I've done

I checked some answers, but none of them are very relevant to my problem, like.

  1. R Match character vectors
  2. Compare two character vectors in R
  3. https://stackoverflow.com/search?q=How to fuzzy match two character vectors
  4. How to run through list of keyword vectors and fuzzy match them to a different file (R)
  5. Matching strings with abbreviations; fuzzy matching

Reproducible code

fruits_Jack = c('XXappleYYY,lemon,orange,pitaya')
fruits_Rose = c('Navel orange,Blood orange,watermelon,cherry')
fruits_Biden= c('pitaya,cherry,banana')

fruits_list = c('apple', 'lemon', 'orange', 'watermelon', 'peach', 'pear')

df = 
  data.frame(id         = c('Jack', 'Rose', 'Biden'),
             fruits_eat = c(fruits_Jack, fruits_Rose, fruits_Biden))

> df
     id                                  fruits_eat
1  Jack                   apple,lemon,orange,pitaya
2  Rose Navel orange,Blood orange,watermelon,cherry
3 Biden                        pitaya,cherry,banana


Expect output

df_expect = cbind(df, fruits_in_list = c(1, 1, 0))

> df_expect
     id                                  fruits_eat fruits_in_list
1  Jack                   apple,lemon,orange,pitaya              1
2  Rose Navel orange,Blood orange,watermelon,cherry              1
3 Biden                        pitaya,cherry,banana              0

CodePudding user response:

With stringr, use str_detect, or str_count if you want a real count:

library(stringr)
library(dplyr)
df %>% 
  mutate(fruits_in_list =  (str_detect(fruits_eat, paste0(fruits_list, collapse = "|"))),
         count = str_count(fruits_eat, paste0(fruits_list, collapse = "|")))
     id                                  fruits_eat fruits_in_list count
1  Jack              XXappleYYY,lemon,orange,pitaya              1     3
2  Rose Navel orange,Blood orange,watermelon,cherry              1     3
3 Biden                        pitaya,cherry,banana              0     0

CodePudding user response:

A solution using data.table and fast if else fifelse(), as well as the base R function grepl() to do the matching. The "l" on the end of grepl() stands for logical, and that means it will return a TRUE if the pattern is matched anywhere in the string given (fruits_eat), and a FALSE otherwise - this means it can be passed immediately to the test argument of the if else.

The key point here is that you can paste strings "string1" and "string2" together separated by "|", and "string1|string2" matches for "string1" or "string2" inside grepl().

library(data.table)
setDT(df)

df[, fruits_in_list := fifelse(grepl(paste0(fruits_list,
                                            collapse = "|"), fruits_eat),1,0)]
df
      id                                  fruits_eat fruits_in_list
1:  Jack              XXappleYYY,lemon,orange,pitaya              1
2:  Rose Navel orange,Blood orange,watermelon,cherry              1
3: Biden                        pitaya,cherry,banana              0
  • Related