I'm trying to extract state abbreviations from a column of addresses in a dataframe that have varying formats. Example:
"123 Any St., Some City, IL 65234 United States"
"456 Any Other St That Town, CA 62626-1234 US"
I used this code that works for strings with 5-digit zip codes, but doesn't work for strings with 9-digit zip codes:
df$state <- str_extract(df$address, "\\b[A-Z]{2}(?=\\s \\d{5}$)")
How do I change this so that it extracts states followed by both 5-digit and 9-digit zip codes?
CodePudding user response:
When I use your code for 5-digits zip codes on the exmaple strings it doesn't work and returns NA
s.
If we delete the last $
then it works for both 5-digit and 9-digit zip codes:
teststr <- c("123 Any St., Some City, IL 65234 United States",
"456 Any Other St That Town, CA 62626-1234 US")
stringr::str_extract(teststr, "\\b[A-Z]{2}(?=\\s \\d{5})")
#> [1] "IL" "CA"
Created on 2021-11-02 by the reprex package (v2.0.1)
CodePudding user response:
The code below does not require the str_extract
function:
addresses <- c(
"123 Any St., Some City, IL 65234 United States",
"456 Any Other St That Town, CA 62626-1234 US")
states <- gsub(
paste0(".*(", paste0(state.abb, collapse = "|"), ")",
" \\d{5}(-\\d{4}){0,1}.*"),
"\\1", addresses)
states
# [1] "IL" "CA"
zip_codes <- gsub(
paste0(".*(", paste0(state.abb, collapse = "|"), ")",
" (\\d{5}(-\\d{4}){0,1}).*"),
"\\2", addresses)
zip_codes
# [1] "65234" "62626-1234"