How to delete repeating characters with stringr-CodePudding

I have a dataframe that looks like this:

library(tibble)
tibble(address_bad = c("1 DA DATALPA PL", 
                       "231 ST STERN AVE", 
                       "943 W WESTON AVE"))
#> # A tibble: 3 × 1
#>   address_bad     
#>   <chr>           
#> 1 1 DA DATALPA PL 
#> 2 231 ST STERN AVE
#> 3 943 W WESTON AVE

The distinguishing characteristic is that some (but not all) addresses have the first two letters of the street name as a separate word before the street name (e.g., "1 DA DATALPA PL"). In that case, I want to change it to "1 DATALPA PL". But sometimes, there is a single character indicating the direction (north, south, east, west), which I want to keep (e.g.,"943 W WESTON AVE"). How can I use stringr functions to get the following result:

tibble(address_good = c("1 DATALPA PL",
                        "231 STERN AVE", 
                        "943 W WESTON AVE"))
#> # A tibble: 3 × 1
#>   address_good    
#>   <chr>           
#> 1 1 DATALPA PL    
#> 2 231 STERN AVE   
#> 3 943 W WESTON AVE

CodePudding user response：

We could capture the two uppercase letters ([A-Z]{2}), check for any optional space before the backreference of the captured group (\\s?\\1) and replace with the backreference

df1$address_bad <- gsub("([A-Z]{2})(\\s?\\1)", "\\1", df1$address_bad)
df1$address_bad
[1] "1 DATALPA PL"     "231 STERN AVE"    "943 W WESTON AVE"