Home > Enterprise >  Identify typos between two columns
Identify typos between two columns

Time:08-16

I asked this question but realised it didn't account for any typos/misformating in my data.

In my data there are two columns with strings. If there is only one letter difference (either an additional letter or a mistyped letter in the word) between them then they are probably the same and should match. If a word has the same letters but in a different order, or if there is more than one different/additional character, it shouldn't. Something like this:

Misspellings <- tibble(
      Name1 = c("Location","Tree","Street","Place","Racecar"),
      Name2 = c("Locatione","Treeee","Steept","Pluce","Carrace"),
      Match = c("TRUE", "FALSE", "FALSE","TRUE", "FALSE"))

My crude attempt at a solution cobbled together isn't sensitive enough and doesn't give the results I need. I assume there is more elegant way of trying to solve my issue then this:

Misspellings %>%
      mutate(RMatch = sapply(1:nrow(Misspellings),function(i)agrepl(Misspellings$Name1[i],Misspellings$Name2[i],max.distance=1)))

CodePudding user response:

You can use adist with diag:

diag(adist(Mispellings$Name1, Mispellings$Name2))
#[1] 1 2 2 1 6

Mispellings %>% 
  mutate(RMatch = diag(adist(Name1, Name2)) <= 1)
  Name1    Name2     Match RMatch
  <chr>    <chr>     <chr> <lgl> 
1 Location Locatione TRUE  TRUE  
2 Tree     Treeee    FALSE FALSE 
3 Street   Steept    FALSE FALSE 
4 Place    Pluce     TRUE  TRUE  
5 Racecar  Carrace   FALSE FALSE
  • Related