Splitting a string in a dataframe in R using multiple literal delimiters-CodePudding

I have a single-column dataframe of addresses like this:

ADDRESS
123 Main Street Unit A
456 Main Street Apt 3
789 Main Street Floor 2

I would like to parse the addresses to separate the Unit/Apt/Floor information from the rest of the street address. Is there a simple way to accomplish this, knowing at the outset that the delimiters should be " Unit", " Apt", and " Floor"?

The desired end result would be a two-column dataframe that looks like this:

ADDRESS           UNIT
123 Main Street   Unit A
456 Main Street   Apt 3
789 Main Street   Floor 2

I have tried using separate from the tidyr package, but it only accepts (to my knowledge) a single delimiter argument. So it would be possible to accomplish this task with multiple calls to separate but this seems silly.

df <- df %>% tidyr::separate(ADDRESS, into = c("ADDRESS","UNIT"), sep = ' Apt')
# This would need to repeated using ' Unit' and ' Floor'.

Similarly, it seems that stringr::str_split_fixed() should be able to handle this task, but again I cannot figure out how to complete the process with a single call (i.e., specifying the three delimiters at once).

stringr::str_split_fixed(df$Address, c(' Unit', ' Apt', ' Floor'), 2)
# Does not work! Additionally does not result in additional column in dataframe as desired.

Here is code to create the sample dataframe:

library(dplyr)    # for piping
library(tidyr)
library(stringr)

df <- data.frame(ADDRESS = c("123 Main Street Unit A", "456 Main Street Apt 3", "789 Main Street Floor 2"))

CodePudding user response：

Does this work:

Using base R:

gsub('(\\d \\sMain Street\\s)(.*)','\\2',df$ADDRESS)
[1] "Unit A"  "Apt 3"   "Floor 2"

Using dplyr and stringr:

library(dplyr)
library(stringr)
df %>% mutate(UNIT = str_extract(ADDRESS, '(?<=Main Street ).*'))
                  ADDRESS    UNIT
1  123 Main Street Unit A  Unit A
2   456 Main Street Apt 3   Apt 3
3 789 Main Street Floor 2 Floor 2

CodePudding user response：

Using tidyr::separate you could do:

library(tidyr)

df <- data.frame(ADDRESS = c("123 Main Street Unit A", "456 Main Street Apt 3", "789 Main Street Floor 2"))
df %>% 
  separate(ADDRESS, sep = "\\s(?=Unit|Apt|Floor)", into = c("address", "unit"))
#>           address    unit
#> 1 123 Main Street  Unit A
#> 2 456 Main Street   Apt 3
#> 3 789 Main Street Floor 2

CodePudding user response：

This could also be helpful in base R:

df$UNIT <- trimws(regmatches(df$ADDRESS, regexpr("\\d \\s Main\\s Street\\K(.*)", df$ADDRESS, perl = TRUE)))

                  ADDRESS    UNIT
1  123 Main Street Unit A  Unit A
2   456 Main Street Apt 3   Apt 3
3 789 Main Street Floor 2 Floor 2