Compare one column in database x to another column in database y and return a database z containing-CodePudding

I want to take a list of Customer names and compare them to an internal database to find a high likely match and return a customer code

So I would receive a list of customers like this:

Cx Name
Chicken C.
Water Gmbh
Computer ldt
Food, Glorious Food

and I want to compare it to an internal database like this:

Cx Name database	Cx Number
Tech Co.	9123
Computer LTD.	8123
Chicken Co.	7123
Water Gmbh	6123

and return something like this:

Cx Name	Cx Suggestion
Chicken C.	7123
Water Gmbh	6123
Computer ldt	8123

I was thinking of using a loop and stringdist to compare each cx name to the database and return the highest value score if it scores above a 90% match. But I'm not sure how to best approach this and my loop skills are bit rusty in R.

This is obviously a very crude example. Typically I would do a bit of data cleaning before hand and I would be working with about 500 different customers matched against a database of 5000 - 10000 customers names.

CodePudding user response：

You could try something like this:

library(tidyverse)
library(stringdist)

customers |>
  mutate(dists = map(`Cx Name`, ~stringdist(tolower(.x), tolower(database_names$`Cx Name database`), method = "jaccard", q = 2)),
         max_val = map_dbl(dists, min),
         `Cx Suggestion` = map2_dbl(dists, max_val, ~ifelse(.y < 0.4, database_names$`Cx Number`[which.min(.x)], NA_real_))) |>
  select(`Cx Name`, `Cx Suggestion`)
#> # A tibble: 4 x 2
#>   `Cx Name`           `Cx Suggestion`
#>   <chr>                         <dbl>
#> 1 Chicken C.                     7123
#> 2 Water Gmbh                     6123
#> 3 Computer ldt                   8123
#> 4 Food, Glorious Food              NA

You will need to play around with the threshold that would work for you. I set it to 0.4 here, but you could go lower if things need to fit better. I also recomeng looking into fuzzy_join.

Data:

customers <- tribble(~`Cx Name`,
"Chicken C.",
"Water Gmbh",
"Computer ldt",
"Food, Glorious Food")


database_names <- tribble(~`Cx Name database`,  ~`Cx Number`,
"Tech Co.", 9123,
"Computer LTD.",    8123,
"Chicken Co.",  7123,
"Water Gmbh",   6123)