How to extract postal code from a string of text into a new column, in R?-CodePudding

I have a dataframe of >10,000 rows. Column c is the column containing the full address in string, including the postal code. I would like to extract the postal code digits (6 digits) into a new column. All 6-digit postal codes come after the word, Singapore.

An example is as follows:

df <- c(a,b,c)

c <- c("YVL WELLNESS CLINIC 510 CAMDEN STREET #01-01, Singapore 248180", "MOMO CLINIC 512 CHOA CHU KANG STREET, #10-1102, Singapore 150902",...)

# need to extract 6-digit postal codes in c, into a new column, d

How do I extract the 6 digit postal codes into a new column, d?

Thank you!

CodePudding user response：

Use str_extract:

library(dplyr)
library(stringr)  
df %>%
    mutate(d = str_extract(c, "\\d{6}"))
   a  b                                                                c      d
1 NA NA   YVL WELLNESS CLINIC 510 CAMDEN STREET #01-01, Singapore 248180 248180
2 NA NA MOMO CLINIC 512 CHOA CHU KANG STREET, #10-1102, Singapore 150902 150902

The regex pattern here is simply for any 6-digit string. If you have cases where such strings occur that are not postal codes you can refine the pattern using contextual information around the codes. For example it appears that the postal codes always occur at the end of the string. That end-of-string position can be targeted by the anchor $, like so: \\d{6}$

Data:

  df <- data.frame(
    a = NA,
    b = NA,
    c = c("YVL WELLNESS CLINIC 510 CAMDEN STREET #01-01, Singapore 248180", "MOMO CLINIC 512 CHOA CHU KANG STREET, #10-1102, Singapore 150902")
  )

CodePudding user response：

Answer:

dummy <- c("YVL WELLNESS CLINIC 510 CAMDEN STREET #01-01, Singapore 248180", "MOMO CLINIC 512 CHOA CHU KANG STREET, #10-1102, Singapore 150902")
regmatches(dummy, regexpr("(\\d{6})", dummy))
[1] "248180" "150902"

CodePudding user response：

In case your data is organized throughout in this fashion with the postal code at the end then we could consider two more alternatives using stringr package. This will extract only the last word in the string:

library(stringr)
word(c,-1)

str_extract(c, '\\w $')

[1] "248180" "150902"