I have a somewhat messy address database that track moves by a given order in long format. I want to add columns to match it to the next address, but I want to skip it / drop the entry if the next address is too close.
The process I have so far mirrors this one:
ID = c("A", "A", "A", "A", "B", "B", "B")
Set = c(1, 2, 3, 4, 1, 2, 3)
Street = c("123 Mulburry St", "234 Willow St", "235 Willow St", "492 Basin st", "231 Farrow st", "492 Little st", "231 Big st")
df = data.frame(ID, Set, Street)
df <- df %>% mutate(set_id = paste0(ID, "_", Set))
df <- df %>% mutate(next_id = paste0(ID, "_", (Set 1)))
match <- subset(df, select = c(set_id, Street))
match <- match %>% rename(next_street = Street)
df <- left_join(df, match, by = c("next_id" = "set_id"))
df$problem <- stringdist(df$Street, df$next_street) < 2
However, I would want to drop A3, where the set goes from 234 Willow St to 235 Willow St, assuming that this is an error / typo (basically, drop what is flagged in df$problem). But I can't really figure how I would drop that and replace it with the correct line (which, would be from A_2 to A_4).
CodePudding user response:
This looks at the next address and drops it if it is close. It uses agrepl
which can also be fine tuned with cost
and max.distance
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Street2=lag(Street)) %>%
rowwise() %>%
mutate(Street=ifelse(agrepl(Street, Street2)==T,NA,Street), Street2=NULL) %>%
ungroup()
# A tibble: 7 × 3
ID Set Street
<chr> <dbl> <chr>
1 A 1 123 Mulburry St
2 A 2 234 Willow St
3 A 3 NA
4 A 4 492 Basin st
5 B 1 231 Farrow st
6 B 2 492 Little st
7 B 3 231 Big st