Split numbers and dates into separate columns-CodePudding

My data contains text strings with three important features, an id number separated by":" and a starting date and an end date. I need to get these tree numbers into three separate columns. I have tried different solutions, everything from unnest_tokens, grepl/grep, to separate, but can't seem to get it right, I may get one date, but I can't seem to get them in the correct order or into a data frame.

input data

input<- data.frame(
  id=c(1,2,3),
  value=c("a long title containing all sorts - off `characters` 2022:03 29.10.2021 
  21.02.2022",
  "but the strings always end with the same - document id, start date: and end date  2022:02 
  30.04.2020 18.02.2022",
  "so I need to split document id, start and end dates into separate columns 2000:01 
  07.10.2000 15.02.2021")
  )

desired output

output <-data.frame(
 id=c(1,2,3),
 value=c("a long title containing all sorts - off `characters`",
 "but the strings allwasys end with the same - document id, start date: and end date",
 "so i need to split document id, start and end dates into seperate collumns"),
 docid=c("2022:03", "2022:02", "2000:01"),
 start=c("29.10.2021", "30.04.2020", "07.10.2000"),
 end=c("21.02.2022", "18.02.2022", "15.02.2021")
  )

CodePudding user response：

This is most conveniently accomplished by extract: in its regex argument we exhaustively describe the strings we want to split into columns as a complex pattern in which the parts that need to go into the columns are wrapped into capture groups (...):

library(tidyr)
input %>%
  extract(value,
          into = c("value", "docid", "start", "end"),
          regex = "(.*)\\s(\\d{4}:\\d{2})\\s{1,}(.*)\\s{1,}(.*)")
  id                                                                             value   docid      start
1  1                              a long title containing all sorts - off `characters` 2022:03 29.10.2021
2  2 but the strings always end with the same - document id, start date: and end date  2022:02 30.04.2020
3  3         so I need to split document id, start and end dates into separate columns 2000:01 07.10.2000
         end
1 21.02.2022
2 18.02.2022
3 15.02.2021