I need to parse through a list of addresses and remove the ones without street numbers or with PO boxes. I want to create a column, street number, that is just the numbers at the front of the string, and NA if it starts with letters. So for:
street<-c("123 fake st", "PO box 12", "fake st unit 2", "123 fake st apt 1")
I would want:
c(123, NA, NA, 123)
I see a lot of q&a's for subsetting numbers from a string, but I'm not sure how to do it without pulling in the numbers from the back end too.
CodePudding user response:
We can use str_extract
to capture the digits (\\d
) at the start (^
) of the string
library(stringr)
as.numeric(str_extract(street, "^\\d "))
[1] 123 NA NA 123
Or using base R
functions with strsplit
as.numeric(sapply(strsplit(street, " "), `[`, 1))
[1] 123 NA NA 123
or trimws
as.numeric(trimws(street, whitespace = "\\s .*"))
[1] 123 NA NA 123
CodePudding user response:
In base R we can use sub
to replace starting from a non number to the end of the string
as.numeric(sub("\\D .*", "", street))
[1] 123 NA NA 123
If you do not know regular expressions, you can use parse_number
function with ifelse
. as shown below
library(tidyverse)
ifelse(substr(street, 1, 1) %in% 0:9, parse_number(street), NA)
[1] 123 NA NA 123