Use regex to replace duplicate phrases-CodePudding

I need to parse large data files and for reasons unknown the addresses are sometimes repeated, like this:

d<- data.table(name = c("bill", "tom"), address = c("35 Valerie Avenue / 35 Valerie Avenue", "702 / 9 Paddock Street / 702 / 9 Paddock Street"))

I have figured out how to de-dupe the easy ones (e.g. "35 Valerie Avenue / 35 Valerie Avenue") with the following:

replace.dupe.addresses<- function(x){
  rep_expr<- "^(.*)/(.*)$"
  idx<- grepl("/",x) & (trimws(sub(rep_expr, "\\2", x)) == trimws(sub(rep_expr, "\\1",x)))
  x[idx]<- trimws(sub(rep_expr, "\\1",x[idx]))
  x
}

d[,address := replace.dupe.addresses(address)]

But this doesn't work for addresses where the critical '/' is further embedded. I have tried this as my regex: rep_expr<- "^(.*)[:alpha:][:space:]?/(.*)$" but this doesn't work. What regex expression would capture both of these repeating phrases?

CodePudding user response：

Please check the below code

d %>% separate_rows(address, sep = '\\/') %>% mutate(address=trimws(address)) %>% 
group_by(name, address) %>% slice_head(n=1) %>% group_by(name) %>% 
  mutate(address=paste(address, collapse = '/')) %>% slice_head(n=1)

^{Created on 2023-01-27 with reprex v2.0.2}

# A tibble: 2 × 2
# Groups:   name [2]
  name  address             
  <chr> <chr>               
1 bill  35 Valerie Avenue   
2 tom   702/9 Paddock Street

CodePudding user response：

See if this works for your dataset

library(data.table)

d[, .(name, address = lapply(strsplit(address, " / "), function(x) 
  paste(x[!duplicated(x)], collapse=" / "))), by=.I]
   name                address
1: bill      35 Valerie Avenue
2:  tom 702 / 9 Paddock Street

CodePudding user response：

Split on forward slash, then get unique and paste it back:

sapply(strsplit(d$address, " / ", fixed = TRUE), 
       function(i) paste(unique(i), collapse = " "))
# [1] "35 Valerie Avenue"    "702 9 Paddock Street"