I want to take a list of Customer names and compare them to an internal database to find a high likely match and return a customer code
So I would receive a list of customers like this:
Cx Name |
---|
Chicken C. |
Water Gmbh |
Computer ldt |
Food, Glorious Food |
and I want to compare it to an internal database like this:
Cx Name database | Cx Number |
---|---|
Tech Co. | 9123 |
Computer LTD. | 8123 |
Chicken Co. | 7123 |
Water Gmbh | 6123 |
and return something like this:
Cx Name | Cx Suggestion |
---|---|
Chicken C. | 7123 |
Water Gmbh | 6123 |
Computer ldt | 8123 |
I was thinking of using a loop and stringdist to compare each cx name to the database and return the highest value score if it scores above a 90% match. But I'm not sure how to best approach this and my loop skills are bit rusty in R.
This is obviously a very crude example. Typically I would do a bit of data cleaning before hand and I would be working with about 500 different customers matched against a database of 5000 - 10000 customers names.
CodePudding user response:
You could try something like this:
library(tidyverse)
library(stringdist)
customers |>
mutate(dists = map(`Cx Name`, ~stringdist(tolower(.x), tolower(database_names$`Cx Name database`), method = "jaccard", q = 2)),
max_val = map_dbl(dists, min),
`Cx Suggestion` = map2_dbl(dists, max_val, ~ifelse(.y < 0.4, database_names$`Cx Number`[which.min(.x)], NA_real_))) |>
select(`Cx Name`, `Cx Suggestion`)
#> # A tibble: 4 x 2
#> `Cx Name` `Cx Suggestion`
#> <chr> <dbl>
#> 1 Chicken C. 7123
#> 2 Water Gmbh 6123
#> 3 Computer ldt 8123
#> 4 Food, Glorious Food NA
You will need to play around with the threshold that would work for you. I set it to 0.4 here, but you could go lower if things need to fit better. I also recomeng looking into fuzzy_join
.
Data:
customers <- tribble(~`Cx Name`,
"Chicken C.",
"Water Gmbh",
"Computer ldt",
"Food, Glorious Food")
database_names <- tribble(~`Cx Name database`, ~`Cx Number`,
"Tech Co.", 9123,
"Computer LTD.", 8123,
"Chicken Co.", 7123,
"Water Gmbh", 6123)