Note I have already read Split string at first occurrence of an integer in a string however my request is different because I would like to use R.
Suppose I have the following example data frame:
> df = data.frame(name_and_address =
c("Mr. Smith12 Some street",
"Mr. Jones345 Another street",
"Mr. Anderson6 A different street"))
> df
name_and_address
1 Mr. Smith12 Some street
2 Mr. Jones345 Another street
3 Mr. Anderson6 A different street
I would like to split the string at the first occurrence of an integer. Notice that the integers are of varying length.
The desired output can be like the following:
[[1]]
[1] "Mr. Smith"
[2] "12 Some street",
[[2]]
[1] "Mr. Jones"
[2] "345 Another street",
[[3]]
[1] "Mr. Anderson"
[2] "6 A different street"
I have tried the following but I can not get the regular expression correct:
# Attempt 1 (Does not work)
library(data.table)
tstrsplit(df,'(?=\\d )', perl=TRUE, type.convert=TRUE)
# Attempt 2 (Does not work)
library(stringr)
str_split(fha_ltc, "\\d ")
CodePudding user response:
You can use tidyr::extract
:
library(tidyr)
df <- df %>%
extract("name_and_address", c("name", "address"), "(\\D*)(\\d.*)")
## => df
## name address
## 1 Mr. Smith 12 Some street
## 2 Mr. Jones 345 Another street
## 3 Mr. Anderson 6 A different street
The (\D*)(\d.*)
regex matches the following:
(\D*)
- Group 1: any zero or more non-digit chars(\d.*)
- Group 2: a digit and then any zero or more chars as many as possible.
Another solution with stringr::str_split
is also possible:
str_split(df$name_and_address, "(?=\\d)", n=2)
## => [[1]]
## [1] "Mr. Smith" "12 Some street"
## [[2]]
## [1] "Mr. Jones" "345 Another street"
## [[3]]
## [1] "Mr. Anderson" "6 A different street"
The (?=\d)
positive lookahead finds a location before a digit, and n=2
tells stringr::str_split
to only split into 2 chunks max.
CodePudding user response:
I would use sub
here:
df$name <- sub("(\\D ).*", "\\1", df$name_and_address)
df$address <- sub(".*?(\\d .*)", "\\1", df$name_and_address)