I need to parse large data files and for reasons unknown the addresses are sometimes repeated, like this:
d<- data.table(name = c("bill", "tom"), address = c("35 Valerie Avenue / 35 Valerie Avenue", "702 / 9 Paddock Street / 702 / 9 Paddock Street"))
I have figured out how to de-dupe the easy ones (e.g. "35 Valerie Avenue / 35 Valerie Avenue") with the following:
replace.dupe.addresses<- function(x){
rep_expr<- "^(.*)/(.*)$"
idx<- grepl("/",x) & (trimws(sub(rep_expr, "\\2", x)) == trimws(sub(rep_expr, "\\1",x)))
x[idx]<- trimws(sub(rep_expr, "\\1",x[idx]))
x
}
d[,address := replace.dupe.addresses(address)]
But this doesn't work for addresses where the critical '/' is further embedded. I have tried this as my regex: rep_expr<- "^(.*)[:alpha:][:space:]?/(.*)$"
but this doesn't work. What regex expression would capture both of these repeating phrases?
CodePudding user response:
Please check the below code
d %>% separate_rows(address, sep = '\\/') %>% mutate(address=trimws(address)) %>%
group_by(name, address) %>% slice_head(n=1) %>% group_by(name) %>%
mutate(address=paste(address, collapse = '/')) %>% slice_head(n=1)
Created on 2023-01-27 with reprex v2.0.2
# A tibble: 2 × 2
# Groups: name [2]
name address
<chr> <chr>
1 bill 35 Valerie Avenue
2 tom 702/9 Paddock Street
CodePudding user response:
See if this works for your dataset
library(data.table)
d[, .(name, address = lapply(strsplit(address, " / "), function(x)
paste(x[!duplicated(x)], collapse=" / "))), by=.I]
name address
1: bill 35 Valerie Avenue
2: tom 702 / 9 Paddock Street
CodePudding user response:
Split on forward slash, then get unique and paste it back:
sapply(strsplit(d$address, " / ", fixed = TRUE),
function(i) paste(unique(i), collapse = " "))
# [1] "35 Valerie Avenue" "702 9 Paddock Street"