Home > OS >  Split String at First Occurrence of an Integer using R
Split String at First Occurrence of an Integer using R

Time:02-08

Note I have already read Split string at first occurrence of an integer in a string however my request is different because I would like to use R.

Suppose I have the following example data frame:

> df = data.frame(name_and_address =
      c("Mr. Smith12 Some street",
        "Mr. Jones345 Another street",
        "Mr. Anderson6 A different street"))
> df
                  name_and_address
1          Mr. Smith12 Some street
2      Mr. Jones345 Another street
3 Mr. Anderson6 A different street

I would like to split the string at the first occurrence of an integer. Notice that the integers are of varying length.

The desired output can be like the following:

[[1]]
[1] "Mr. Smith"
[2] "12 Some street",

[[2]]
[1] "Mr. Jones"
[2] "345 Another street",

[[3]]
[1] "Mr. Anderson"
[2] "6 A different street"

I have tried the following but I can not get the regular expression correct:

# Attempt 1 (Does not work)
library(data.table)
tstrsplit(df,'(?=\\d )', perl=TRUE, type.convert=TRUE)

# Attempt 2 (Does not work)
library(stringr)
str_split(fha_ltc, "\\d ")

CodePudding user response:

You can use tidyr::extract:

library(tidyr)
df <- df %>% 
    extract("name_and_address", c("name", "address"), "(\\D*)(\\d.*)")
## => df
##           name              address
## 1    Mr. Smith       12 Some street
## 2    Mr. Jones   345 Another street
## 3 Mr. Anderson 6 A different street

The (\D*)(\d.*) regex matches the following:

  • (\D*) - Group 1: any zero or more non-digit chars
  • (\d.*) - Group 2: a digit and then any zero or more chars as many as possible.

Another solution with stringr::str_split is also possible:

str_split(df$name_and_address, "(?=\\d)", n=2)
## => [[1]]
## [1] "Mr. Smith"      "12 Some street"

## [[2]]
## [1] "Mr. Jones"          "345 Another street"

## [[3]]
## [1] "Mr. Anderson"         "6 A different street"

The (?=\d) positive lookahead finds a location before a digit, and n=2 tells stringr::str_split to only split into 2 chunks max.

CodePudding user response:

I would use sub here:

df$name <- sub("(\\D ).*", "\\1", df$name_and_address)
df$address <- sub(".*?(\\d .*)", "\\1", df$name_and_address)
  •  Tags:  
  • Related