I have a dataframe that looks like this:
library(tibble)
tibble(address_bad = c("1 DA DATALPA PL",
"231 ST STERN AVE",
"943 W WESTON AVE"))
#> # A tibble: 3 × 1
#> address_bad
#> <chr>
#> 1 1 DA DATALPA PL
#> 2 231 ST STERN AVE
#> 3 943 W WESTON AVE
The distinguishing characteristic is that some (but not all) addresses have the first two letters of the street name as a separate word before the street name (e.g., "1 DA DATALPA PL"). In that case, I want to change it to "1 DATALPA PL". But sometimes, there is a single character indicating the direction (north, south, east, west), which I want to keep (e.g.,"943 W WESTON AVE"). How can I use stringr
functions to get the following result:
tibble(address_good = c("1 DATALPA PL",
"231 STERN AVE",
"943 W WESTON AVE"))
#> # A tibble: 3 × 1
#> address_good
#> <chr>
#> 1 1 DATALPA PL
#> 2 231 STERN AVE
#> 3 943 W WESTON AVE
CodePudding user response:
We could capture the two uppercase letters ([A-Z]{2}
), check for any optional space before the backreference of the captured group (\\s?\\1
) and replace with the backreference
df1$address_bad <- gsub("([A-Z]{2})(\\s?\\1)", "\\1", df1$address_bad)
df1$address_bad
[1] "1 DATALPA PL" "231 STERN AVE" "943 W WESTON AVE"