Home > Enterprise >  Is there an R package (or existing function) for fuzzy string detection?
Is there an R package (or existing function) for fuzzy string detection?

Time:06-23

I'm looking to something similar to str_detect() from the stringr package, but which is capable of detecting imperfect or "fuzzy" matches. Preferably, I'd like to be able to specify the degree of imperfection (1 different character, 2 different characters, etc.).

The matching I'm doing will take a form similar to the below code (but this is just a simplified example I made up). In the example, only "RUTH CHRIS" gets matched - I'd like something capable of matching the slightly wrong strings as well.

library(tidyverse)

my_restaurants <- tibble(restaurant = c("MCDOlNALD'S ON FRANKLIN ST",
                                        "NEW JERSEY WENDYS",
                                        "8/25/19 RUTH CHRIS",
                                        "MELTINGPO 9823i3")
)

cheap <- c("MCDONALD'S", "WENDY'S") %>% str_c(collapse="|")
expensive <- c("RUTH CHRIS", "MELTING POT") %>% str_c(collapse="|")

my_restaurants %>%
  mutate(category = case_when(
    str_detect(restaurant, cheap) ~ "CHEAP",
    str_detect(restaurant, expensive) ~ "EXPENSIVE"
    )) 

So again, this gives this output:

##  A tibble: 4 × 2
#   restaurant                 category 
#   <chr>                      <chr>    
# 1 MCDOlNALD'S ON FRANKLIN ST NA       
# 2 NEW JERSEY WENDYS          NA       
# 3 8/25/19 RUTH CHRIS         EXPENSIVE
# 4 MELTINGPOT 9823i3          NA 

But I want:

## A tibble: 4 × 2
#   restaurant                 category 
#   <chr>                      <chr>    
# 1 MCDOlNALD'S ON FRANKLIN ST CHEAP       
# 2 NEW JERSEY WENDYS          CHEAP       
# 3 8/25/19 RUTH CHRIS         EXPENSIVE
# 4 MELTINGPOT 9823i3          EXPENSIVE 

I'm not against using regex, but my actual data is significantly more complicated than the given example, so I'd prefer something much more concise that allows for general, not specifc, types of fuzziness.

CodePudding user response:

In Base R, You could do:

cheap <- c("MCDONALD'S", "WENDY'S") 
expensive <- c("RUTH CHRIS", "MELTING POT")

pat <- stack(list(cheap = cheap, expensive = expensive))

transform(my_restaurants, category=pat[sapply(pat$values,agrep,restaurant),2])

                  restaurant  category
1 MCDOlNALD'S ON FRANKLIN ST     cheap
2          NEW JERSEY WENDYS     cheap
3         8/25/19 RUTH CHRIS expensive
4           MELTINGPO 9823i3 expensive

CodePudding user response:

You can use fuzzyjoin::stringdist_left_join

cheap <- c("MCDONALD'S", "WENDY'S") 
expensive <- c("RUTH CHRIS", "MELTING POT")

pat <- stack(list(cheap = cheap, expensive = expensive))

fuzzyjoin::stringdist_left_join(my_restaurants, pat, 
      c(restaurant='values'), max_dist=0.45, method = 'jaccard')

# A tibble: 4 x 3
  restaurant                 values      ind      
  <chr>                      <chr>       <fct>    
1 MCDOlNALD'S ON FRANKLIN ST MCDONALD'S  cheap    
2 NEW JERSEY WENDYS          WENDY'S     cheap    
3 8/25/19 RUTH CHRIS         RUTH CHRIS  expensive
4 MELTINGPO 9823i3           MELTING POT expensive
  • Related