I am trying to separate data from one column. I am using separate
function from tidyr
package for this. But I am unable to create a regex pattern for this. The data is as follows:
structure(list(Details = c("\r\n\t\t\t\t\t\t\t\t\t\tReference No: F&A0067\r\n\t\t\t\t\t\t\t\t\t\tPages: 5\t\r\n\t\t\t\t\t\t\t\t\t\tPublished on: 1, January, 1965\r\n\t\t\t\t\t\t\t\t\t",
"\r\n\t\t\t\t\t\t\t\t\t\tReference No: OB0233\r\n\t\t\t\t\t\t\t\t\t\tPages: 7\r\n\t\t\t\t\t\t\t\t\t\tPublished on: 26, June, 2018\r\n\t\t\t\t\t\t\t\t\t",
"\r\n\t\t\t\t\t\t\t\t\t\tReference No: F&A0303\r\n\t\t\t\t\t\t\t\t\t\tPages: 1\t\r\n\t\t\t\t\t\t\t\t\t\tPublished on: 1, January, 1977\r\n\t\t\t\t\t\t\t\t\t",
"\r\n\t\t\t\t\t\t\t\t\t\tReference No: BP0101\r\n\t\t\t\t\t\t\t\t\t\tPages: 7\r\n\t\t\t\t\t\t\t\t\t\tPublished on: 1, January, 1977\r\n\t\t\t\t\t\t\t\t\t",
"\r\n\t\t\t\t\t\t\t\t\t\tReference No: P&IR0229\r\n\t\t\t\t\t\t\t\t\t\tPages: 4\r\n\t\t\t\t\t\t\t\t\t\tPublished on: 23, June, 2015\r\n\t\t\t\t\t\t\t\t\t",
"\r\n\t\t\t\t\t\t\t\t\t\tReference No: BP0299\r\n\t\t\t\t\t\t\t\t\t\tPages: 32\r\n\t\t\t\t\t\t\t\t\t\tPublished on: 28, October, 2004\r\n\t\t\t\t\t\t\t\t\t"
)), row.names = c(NA, 6L), class = "data.frame")
I want to split the data in 5 columns: Reference, Pages, Date, Month and Year. The details I want to capture in the first row are :F&A0067, 5, 1, January and 1965 in these five columns respectively.
So, what regex pattern should I use inside tidyr::separate
function to split this data in these five columns?
CodePudding user response:
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(
# remove tabs
Details = str_replace_all(Details, "[\\t]", ""),
# remove whitespace from the ends
Details = trimws(Details)
) %>%
tidyr::separate(
Details,
into = c("Reference", "Pages", "Date"),
# first split on line breaks
sep = "\\r\\n"
) %>%
mutate(
across(everything(),
str_replace,
pattern = ".*: ", # replacing everything through ': '
replacement = ""
)
) %>%
separate(
Date,
into = c("Day", "Month", "Year"),
# then split the date
sep = ", "
)
# Reference Pages Day Month Year
# 1 F&A0067 5 1 January 1965
# 2 OB0233 7 26 June 2018
# 3 F&A0303 1 1 January 1977
# 4 BP0101 7 1 January 1977
# 5 P&IR0229 4 23 June 2015
# 6 BP0299 32 28 October 2004