How would I extract the date from a string that contains random information?-CodePudding

I have strings in the format of

NHS Workforce%20Statistics,%20April 2018%20Organisation -%20Excel%20tables.xlsx

NHS Workforce%20Statistics,%20September 2018%20Organisation.xlsx

How would I extract the date from it? In the first example, the date would be April 2018, and the second example it would be September 2018. (Note, it will not always be 2018)

So far I have tried creating a column vector of months and doing str_match to see if any of the strings contain the months in the vector. I was then planning on using regex to find the 6 digit value that contains the date, and select the last four of those values. I feel this way is quite long and there is a quicker solution using tidyverse.

CodePudding user response：

Notice that the spaces in the file names are being replaced with .

Something like the following will work (you'll just have to add the rest of the months to the regex expression.

example <- "NHS Workforce Statistics, April 2018 Organisation - Excel tables.xlsx"
example2 <- "NHS Workforce Statistics, September 2018 Organisation.xlsx"


file_name <- str_replace_all(example, " ", " ")
str_extract(file_name, "(April|September) \\d{4}")

For the first example you get:

[1] "April 2018"

CodePudding user response：

We can create the list of months pretty swiftly using format(..., "%B), then all that's left is to put that into a pattern that should be extracted:

pattern <- paste0("^.*(", paste(format(ISOdate(2020,1:12,1),"%B"), collapse = "|"), ") (\\d ).*$")
gsub(pattern, "\\1 \\2", your_text)

CodePudding user response：

Extract the 1st word along with a four digit number occurring after it.

Base R option -

vec <- c("NHS Workforce Statistics, April 2018 Organisation - Excel tables.xlsx",
         "NHS Workforce Statistics, September 2018 Organisation.xlsx")

return_date <- function(x) {
  sub('.*?([A-Za-z] ) .*(\\d{4}).*', '\\1 \\2', x)
}

return_date(vec)
#[1] "April 2018"     "September 2018"