Home > Software engineering >  splitting strings to columns from the end of the character string
splitting strings to columns from the end of the character string

Time:12-17

I have a dataset:

dummydata = data.frame(code = c("Aardenburg G 18", "Loon op Zand H 262"))

And I want to turn it into this:

enddata = data.frame(municipality = c("Aardenburg", "Loon op Zand"),
                     section = c("G", "H"),
                     number = c("18", "262"))

Because the municipality name often has spaces in them, I have to start splitting from the end of the character string. The data always is the municipality name followed by a space, an uppercase letter, and then an integer. I looked up on the internet and stackoverflow and ran into regex lines which are totally new for me.

I managed to split the number into a different column with the following code but I cannot seem to get the uppercase letter in its own column.

dummydata2 = dummydata %>% 
  tidyr::separate("code", into = c("municipality", "number"), "(A-Z)|(?=[0-9])", extra = "merge", fill = "right", remove = F)

Anyone who can help and clarify how the regex lines work? Thanks

CodePudding user response:

A possible solution:

library(tidyverse)

dummydata = data.frame(code = c("Aardenburg G 18", "Loon op Zand H 262"))

dummydata %>% 
  separate(code, into=c("municipality", "section", "number"), 
           sep=" (?=\\d )| (?=[A-Z] \\d )", convert = T)

#>   municipality section number
#> 1   Aardenburg       G     18
#> 2 Loon op Zand       H    262

CodePudding user response:

Try this:

require(stringr)
endata <- dummydata %>%
  mutate(section = word(code, -2),
         number = word(code, -1))

changed a little your syntax, used word function from stringr instead. It selects the nth word, but by indicating a negative index, it reverses the order from the last to the first word.

CodePudding user response:

1) Replace the last space with semicolon and then do it again. Now it can be read as semicolon separated strings. Note that "(.*) " matches the longest string until and including the last space and \1; says to replace it with the part in parenthesis followed by semicolon. Backslashes need to be doubled within quotes. No packages are used.

dummydata |>
  transform(code = sub("(.*) ", "\\1;", code)) |>
  transform(code = sub("(.*) ", "\\1;", code)) |>
  with(read.table(text = code, sep = ";", 
    col.names = c("municipality", "section", "number")))

giving:

  municipality section number
1   Aardenburg       G     18
2 Loon op Zand       H    262

2) Another way to use sub is to match each field and replace the spaces between them with semicolons giving the same result. This uses one fewer sub but the regular expression is a bit more complex. \w matches one or more word characters and \d matches one or more digits. \1 refers to the first capture group, i.e. the part matched by the portion within the first parenthesis in the first argument. \2 and \3 match the second and third capture groups. No packages are used.

dummydata |>
  transform(code = sub("^(.*) (\\w ) (\\d )$", "\\1;\\2;\\3", code)) |>
  with(read.table(text = code, sep = ";", 
    col.names = c("municipality", "section", "number")))

3) We can reduce (2) to a single read.pattern statement using gsubfn package. The pattern is the same as in (2).

library(gsubfn)

read.pattern(text = dummydata$code, pattern = "^(.*) (\\w ) (\\d )$",
  col.names = c("municipality", "section", "number"))

CodePudding user response:

Sometimes regex is an overkill:

enddata <- strsplit(dummydata$code, " ") |> 
  lapply(
    \(x) data.frame(
      municipality  = paste(head(x, -2), collapse = " "), 
      section = tail(x, 2)[1], 
      number = tail(x, 1)
    )
  )
enddata <- do.call(rbind, enddata)

#   municipality section number
# 1   Aardenburg       G     18
# 2 Loon op Zand       H    262
  •  Tags:  
  • r
  • Related