I'm looking to something similar to str_detect() from the stringr package, but which is capable of detecting imperfect or "fuzzy" matches. Preferably, I'd like to be able to specify the degree of imperfection (1 different character, 2 different characters, etc.).
The matching I'm doing will take a form similar to the below code (but this is just a simplified example I made up). In the example, only "RUTH CHRIS" gets matched - I'd like something capable of matching the slightly wrong strings as well.
library(tidyverse)
my_restaurants <- tibble(restaurant = c("MCDOlNALD'S ON FRANKLIN ST",
"NEW JERSEY WENDYS",
"8/25/19 RUTH CHRIS",
"MELTINGPO 9823i3")
)
cheap <- c("MCDONALD'S", "WENDY'S") %>% str_c(collapse="|")
expensive <- c("RUTH CHRIS", "MELTING POT") %>% str_c(collapse="|")
my_restaurants %>%
mutate(category = case_when(
str_detect(restaurant, cheap) ~ "CHEAP",
str_detect(restaurant, expensive) ~ "EXPENSIVE"
))
So again, this gives this output:
## A tibble: 4 × 2
# restaurant category
# <chr> <chr>
# 1 MCDOlNALD'S ON FRANKLIN ST NA
# 2 NEW JERSEY WENDYS NA
# 3 8/25/19 RUTH CHRIS EXPENSIVE
# 4 MELTINGPOT 9823i3 NA
But I want:
## A tibble: 4 × 2
# restaurant category
# <chr> <chr>
# 1 MCDOlNALD'S ON FRANKLIN ST CHEAP
# 2 NEW JERSEY WENDYS CHEAP
# 3 8/25/19 RUTH CHRIS EXPENSIVE
# 4 MELTINGPOT 9823i3 EXPENSIVE
I'm not against using regex, but my actual data is significantly more complicated than the given example, so I'd prefer something much more concise that allows for general, not specifc, types of fuzziness.
CodePudding user response:
In Base R, You could do:
cheap <- c("MCDONALD'S", "WENDY'S")
expensive <- c("RUTH CHRIS", "MELTING POT")
pat <- stack(list(cheap = cheap, expensive = expensive))
transform(my_restaurants, category=pat[sapply(pat$values,agrep,restaurant),2])
restaurant category
1 MCDOlNALD'S ON FRANKLIN ST cheap
2 NEW JERSEY WENDYS cheap
3 8/25/19 RUTH CHRIS expensive
4 MELTINGPO 9823i3 expensive
CodePudding user response:
You can use fuzzyjoin::stringdist_left_join
cheap <- c("MCDONALD'S", "WENDY'S")
expensive <- c("RUTH CHRIS", "MELTING POT")
pat <- stack(list(cheap = cheap, expensive = expensive))
fuzzyjoin::stringdist_left_join(my_restaurants, pat,
c(restaurant='values'), max_dist=0.45, method = 'jaccard')
# A tibble: 4 x 3
restaurant values ind
<chr> <chr> <fct>
1 MCDOlNALD'S ON FRANKLIN ST MCDONALD'S cheap
2 NEW JERSEY WENDYS WENDY'S cheap
3 8/25/19 RUTH CHRIS RUTH CHRIS expensive
4 MELTINGPO 9823i3 MELTING POT expensive