Extracting state abbreviation from string with 5 and 9 digit zip code in R-CodePudding

I'm trying to extract state abbreviations from a column of addresses in a dataframe that have varying formats. Example:

"123 Any St., Some City, IL 65234 United States"
"456 Any Other St That Town, CA 62626-1234 US"

I used this code that works for strings with 5-digit zip codes, but doesn't work for strings with 9-digit zip codes:

df$state <- str_extract(df$address, "\\b[A-Z]{2}(?=\\s \\d{5}$)")

How do I change this so that it extracts states followed by both 5-digit and 9-digit zip codes?

CodePudding user response：

When I use your code for 5-digits zip codes on the exmaple strings it doesn't work and returns NAs.

If we delete the last $ then it works for both 5-digit and 9-digit zip codes:

teststr <- c("123 Any St., Some City, IL 65234 United States",
             "456 Any Other St That Town, CA 62626-1234 US")

stringr::str_extract(teststr, "\\b[A-Z]{2}(?=\\s \\d{5})")
#> [1] "IL" "CA"

^{Created on 2021-11-02 by the reprex package (v2.0.1)}

CodePudding user response：

The code below does not require the str_extract function:

addresses <- c(
  "123 Any St., Some City, IL 65234 United States",
  "456 Any Other St That Town, CA 62626-1234 US")

states <- gsub(
  paste0(".*(", paste0(state.abb, collapse = "|"), ")",
         " \\d{5}(-\\d{4}){0,1}.*"),
  "\\1", addresses)

states
# [1] "IL" "CA"

zip_codes <- gsub(
  paste0(".*(", paste0(state.abb, collapse = "|"), ")",
         " (\\d{5}(-\\d{4}){0,1}).*"),
  "\\2", addresses)

zip_codes
# [1] "65234"      "62626-1234"