finding and matching next address, but want to drop if string is too close-CodePudding

I have a somewhat messy address database that track moves by a given order in long format. I want to add columns to match it to the next address, but I want to skip it / drop the entry if the next address is too close.

The process I have so far mirrors this one:

ID = c("A", "A", "A", "A", "B", "B", "B")
Set = c(1, 2, 3, 4, 1, 2, 3)
Street = c("123 Mulburry St", "234 Willow St", "235 Willow St", "492 Basin st", "231 Farrow st", "492 Little st", "231 Big st")

df = data.frame(ID, Set, Street)

df <- df %>% mutate(set_id = paste0(ID,  "_", Set))
df <- df %>% mutate(next_id = paste0(ID,  "_", (Set   1)))

match <- subset(df, select = c(set_id, Street))
match <- match %>% rename(next_street = Street)

df <- left_join(df, match, by = c("next_id" = "set_id"))
df$problem <- stringdist(df$Street, df$next_street) < 2

However, I would want to drop A3, where the set goes from 234 Willow St to 235 Willow St, assuming that this is an error / typo (basically, drop what is flagged in df$problem). But I can't really figure how I would drop that and replace it with the correct line (which, would be from A_2 to A_4).

CodePudding user response：

This looks at the next address and drops it if it is close. It uses agrepl which can also be fine tuned with cost and max.distance

library(dplyr)

df %>% 
  group_by(ID) %>% 
  mutate(Street2=lag(Street)) %>% 
  rowwise() %>% 
  mutate(Street=ifelse(agrepl(Street, Street2)==T,NA,Street), Street2=NULL) %>% 
  ungroup()
# A tibble: 7 × 3
  ID      Set Street         
  <chr> <dbl> <chr>          
1 A         1 123 Mulburry St
2 A         2 234 Willow St  
3 A         3 NA             
4 A         4 492 Basin st   
5 B         1 231 Farrow st  
6 B         2 492 Little st  
7 B         3 231 Big st