I am attempting to clean out addresses that come in this format 1804 E Osage Rd DERBY KS 670378863
or 55 Cabela Dr GARNER NC 27529
As shown, the postal codes towards the end of the address are inconsistent and I would like to remove the numeric portion of the address from the right overall. In excel I am able to use the =LEFT(A2, Len(A2)-x))
but it's still not good, since the x is not variable based on the length of the numeric characters in the string.
How can I use R or regex, to remove all numeric characters from the right until a non-numeric character is reached?
Expected output to look like -
raw_Address | clean_Address |
---|---|
1804 E Osage Rd DERBY KS 670378863 | 1804 E Osage Rd DERBY KS |
55 Cabela Dr GARNER NC 27529 | 55 Cabela Dr GARNER NC |
CodePudding user response:
We may use trimws
from base R
- match the one or more whitespace followed by the one or more digits which remove the one at the right
df1$clean_Address <- trimws(df1$raw_Address, whitespace = "\\s \\d ")
-output
> df1
raw_Address clean_Address
1 1804 E Osage Rd DERBY KS 670378863 1804 E Osage Rd DERBY KS
2 55 Cabela Dr GARNER NC 27529 55 Cabela Dr GARNER NC
data
df1 <- structure(list(raw_Address = c("1804 E Osage Rd DERBY KS 670378863",
"55 Cabela Dr GARNER NC 27529")), row.names = c(NA, -2L), class = "data.frame")
CodePudding user response:
Using {stringr}
raw_Address <- c("1804 E Osage Rd DERBY KS 670378863", "55 Cabela Dr
GARNER NC 27529")
library(stringr)
str_replace(raw_Address, "\\s\\d $", "")
#or even more simply
str_remove(raw_Address, "\\s\\d $")
#> [1] "1804 E Osage Rd DERBY KS" "55 Cabela Dr GARNER NC"
Created on 2022-03-18 by the reprex package (v2.0.1)