I asked this question but realised it didn't account for any typos/misformating in my data.
In my data there are two columns with strings. If there is only one letter difference (either an additional letter or a mistyped letter in the word) between them then they are probably the same and should match. If a word has the same letters but in a different order, or if there is more than one different/additional character, it shouldn't. Something like this:
Misspellings <- tibble(
Name1 = c("Location","Tree","Street","Place","Racecar"),
Name2 = c("Locatione","Treeee","Steept","Pluce","Carrace"),
Match = c("TRUE", "FALSE", "FALSE","TRUE", "FALSE"))
My crude attempt at a solution cobbled together isn't sensitive enough and doesn't give the results I need. I assume there is more elegant way of trying to solve my issue then this:
Misspellings %>%
mutate(RMatch = sapply(1:nrow(Misspellings),function(i)agrepl(Misspellings$Name1[i],Misspellings$Name2[i],max.distance=1)))
CodePudding user response:
You can use adist
with diag
:
diag(adist(Mispellings$Name1, Mispellings$Name2))
#[1] 1 2 2 1 6
Mispellings %>%
mutate(RMatch = diag(adist(Name1, Name2)) <= 1)
Name1 Name2 Match RMatch
<chr> <chr> <chr> <lgl>
1 Location Locatione TRUE TRUE
2 Tree Treeee FALSE FALSE
3 Street Steept FALSE FALSE
4 Place Pluce TRUE TRUE
5 Racecar Carrace FALSE FALSE