How to write regular expression to parse address to extract unit number (R)?-CodePudding

I am trying to parse an address column that has the unit number built into it and extract the unit number to a new column.

df <- tibble(
  address = c(
  "UNIT 1 1234 FAKE STREET",
  "UNIT 10 1234 FAKE STREET",
  "UNIT 1000 1234 FAKE STREET",
  "UNIT A 1234 FAKE STREET",
  "UNIT A1 1234 FAKE STREET",
  "UNIT A1000 1234 FAKE STREET",
  "UNIT TH 1234 FAKE STREET",
  "UNIT TH1 1234 FAKE STREET",
  "UNIT TH1000 1234 FAKE STREET",
  "UNIT CRU 1234 FAKE STREET",
  "UNIT CRU 1 1234 FAKE STREET",
  "UNIT CRU 100 1234 FAKE STREET",
  "UNIT 1 1234 FAKE HIGHWAY 1",
  "1234 FAKE STREET",
  "1 1234 FAKE STREET",
  "1000 1234 FAKE STREET",
  "1 FAKE FAKE STREET",
  "10 FAKE FAKE STREET", 
  "1000 FAKE FAKE STREET",
  "FAKE STREET"),
  output_desire = c(
    "UNIT 1",
    "UNIT 10",
    "UNIT 1000",
    "UNIT A",
    "UNIT A1",
    "UNIT A1000",
    "UNIT TH",
    "UNIT TH1",
    "UNIT TH1000",
    "UNIT CRU",
    "UNIT CRU 1",
    "UNIT CRU 100",
    "UNIT 1",
    "NA",
    "1",
    "1000",
    "NA",
    "NA",
    "NA",
    "NA"))

What is the best way to write a regular expression to capture the unit numbers?

CodePudding user response：

You can use

df <- df %>%
   mutate(unit = trimws(str_extract(address, "UNIT.*?(?=\\s*\\d{4,})")))
#  A tibble: 11 x 2
#   address                     unit      
#   <chr>                       <chr>     
# 1 UNIT A 1234 FAKE STREET     UNIT A    
# 2 UNIT A1 1234 FAKE STREET    UNIT A1   
# 3 UNIT 1 1234 FAKE STREET     UNIT 1    
# 4 UNIT CRU 1 1234 FAKE STREET UNIT CRU 1
# 5 UNIT 000 1234 FAKE STREET   UNIT 000  
# 6 UNIT TH1 1234 FAKE STREET   UNIT TH1  
# 7 UNIT 1 1234 FAKE HIGHWAY 1  UNIT 1    
# 8 1 1234 FAKE STREET          <NA>      
# 9 1234 FAKE STREET            <NA>      
#10 1 FAKE FAKE STREET          <NA>      
#11 FAKE STREET                 <NA>

The UNIT.*?(?=\s*\d{4,}) regex matches

UNIT - a fixed string
.*? - zero or more chars other than line break chars
(?=\s*\d{4,}) - a location immediately followed with zero or more whitespaces (\s*) and four or more digits (\d{4,}).

See the regex demo.