If I have a data frame in R with two UK postcode fields (both in upper case), is there an easy, convenient way to define a score that is based on some kind of fuzzy comparison between these two fields?
Have done some googling but everything I found was some kind of "fuzzy join", and I don't need the join bit here.
So for example if I had:
my_df <- data.frame(postcode_1 = c("AB1 1AB", "DN17 2DF", "TN38 8LE", "FK1 2ZZ"),
postcode_2 = c("AB1 1AB", "EC1X 3WW", "TN38 9LE", "FK2 1ZZ"))
then I might like to do something like
my_df <- my_df |>
mutate(score = fuzzy_string_compare_thingy(postcode_1, postcode_2))
to give me (for example):
my_df <- data.frame(postcode_1 = c("AB1 1AB", "DN17 2DF", "TN38 8LE", "FK1 2ZZ"),
postcode_2 = c("AB1 1AB", "EC1X 3WW", "TN38 9LE", "FK2 1ZZ"),
score = c(1, 0.1, 0.9, 0.7))
(the values in the score
field are made up of course)
CodePudding user response:
Choose your desired distance method using e.g. stringdist
library(stringdist)
cbind(my_df,
sapply(c("osa", "lv", "dl", "hamming", "lcs", "qgram",
"cosine", "jaccard", "jw", "soundex" ), function(m)
apply(my_df, 1, function(x) stringdist(x[1], x[2], method=m))))
postcode_1 postcode_2 osa lv dl hamming lcs qgram cosine jaccard
1 AB1 1AB AB1 1AB 0 0 0 0 0 0 0.0000000 0.0000000
2 DN17 2DF EC1X 3WW 6 6 6 6 12 12 0.8000000 0.8333333
3 TN38 8LE TN38 9LE 1 1 1 1 2 2 0.1055728 0.1250000
4 FK1 2ZZ FK2 1ZZ 2 2 2 2 4 0 0.0000000 0.0000000
jw soundex
1 0.00000000 0
2 0.50000000 1
3 0.08333333 0
4 0.04761905 0
Using "jw", also "normalize" to 1 (meaning identity) with dplyr
library(dplyr)
my_df %>%
mutate(score_1 = 1 - stringdist(postcode_1, postcode_2, method="jw"))
postcode_1 postcode_2 score_1
1 AB1 1AB AB1 1AB 1.0000000
2 DN17 2DF EC1X 3WW 0.5000000
3 TN38 8LE TN38 9LE 0.9166667
4 FK1 2ZZ FK2 1ZZ 0.9523810