Home > Software design >  What regex pattern should I use in R to capture strings after a pattern?
What regex pattern should I use in R to capture strings after a pattern?

Time:11-13

I am trying to separate data from one column. I am using separate function from tidyr package for this. But I am unable to create a regex pattern for this. The data is as follows:

structure(list(Details = c("\r\n\t\t\t\t\t\t\t\t\t\tReference No: F&A0067\r\n\t\t\t\t\t\t\t\t\t\tPages: 5\t\r\n\t\t\t\t\t\t\t\t\t\tPublished on: 1, January, 1965\r\n\t\t\t\t\t\t\t\t\t", 
"\r\n\t\t\t\t\t\t\t\t\t\tReference No: OB0233\r\n\t\t\t\t\t\t\t\t\t\tPages: 7\r\n\t\t\t\t\t\t\t\t\t\tPublished on: 26, June, 2018\r\n\t\t\t\t\t\t\t\t\t", 
"\r\n\t\t\t\t\t\t\t\t\t\tReference No: F&A0303\r\n\t\t\t\t\t\t\t\t\t\tPages: 1\t\r\n\t\t\t\t\t\t\t\t\t\tPublished on: 1, January, 1977\r\n\t\t\t\t\t\t\t\t\t", 
"\r\n\t\t\t\t\t\t\t\t\t\tReference No: BP0101\r\n\t\t\t\t\t\t\t\t\t\tPages: 7\r\n\t\t\t\t\t\t\t\t\t\tPublished on: 1, January, 1977\r\n\t\t\t\t\t\t\t\t\t", 
"\r\n\t\t\t\t\t\t\t\t\t\tReference No: P&IR0229\r\n\t\t\t\t\t\t\t\t\t\tPages: 4\r\n\t\t\t\t\t\t\t\t\t\tPublished on: 23, June, 2015\r\n\t\t\t\t\t\t\t\t\t", 
"\r\n\t\t\t\t\t\t\t\t\t\tReference No: BP0299\r\n\t\t\t\t\t\t\t\t\t\tPages: 32\r\n\t\t\t\t\t\t\t\t\t\tPublished on: 28, October, 2004\r\n\t\t\t\t\t\t\t\t\t"
)), row.names = c(NA, 6L), class = "data.frame")

I want to split the data in 5 columns: Reference, Pages, Date, Month and Year. The details I want to capture in the first row are :F&A0067, 5, 1, January and 1965 in these five columns respectively.

So, what regex pattern should I use inside tidyr::separate function to split this data in these five columns?

CodePudding user response:

library(dplyr)
library(tidyr)
library(stringr)

df %>%
  mutate(
    # remove tabs
    Details = str_replace_all(Details, "[\\t]", ""),
    # remove whitespace from the ends
    Details = trimws(Details)
  ) %>%
  tidyr::separate(
    Details, 
    into = c("Reference", "Pages", "Date"),
    # first split on line breaks
    sep = "\\r\\n"
  ) %>%
  mutate(
    across(everything(),
      str_replace,
      pattern = ".*: ", # replacing everything through ': '
      replacement = ""
    )
  ) %>%
  separate(
    Date, 
    into = c("Day", "Month", "Year"),
    # then split the date
    sep = ", "
  )
#  Reference Pages Day   Month Year
# 1   F&A0067     5   1 January 1965
# 2    OB0233     7  26    June 2018
# 3   F&A0303     1   1 January 1977
# 4    BP0101     7   1 January 1977
# 5  P&IR0229     4  23    June 2015
# 6    BP0299    32  28 October 2004
  • Related