Extract text after the last date in R-CodePudding

I have an interesting bit of data that is a function of the customer's data entry process. Each time there is an update the data entry team simply appends the date and relevant comments in the same Excel cell. Thus it looks like this...

entry <- "9/10/2021 received request to order more beer. 9/15/2021 Beer arrived in old truck 10/09/2021 Sent notice to driver."

There are really only two things the team requires going forward and those are extracting the first date, and the last date with the relevant text.

It needs to be in a data frame like this.

First date | Last date   | note
----------- ------------- ----------------------
9/10/2021  | 10/09/2021  | Sent notice to driver

Thank you.

CodePudding user response：

Doubling entry to show this works on vectors of strings:

entry <- rep(entry, 2)

Base R solution:

gre <- gregexpr("[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}.", entry)
# fix the "match.length" to extend until the next match or EOS
gre2 <- Map(function(G, txt) `attr<-`(G, "match.length", c(G[-1] - 1L, nchar(txt))), gre, entry)

do.call(rbind, lapply(regmatches(entry, gre2), function(txt) {
  dat <- strcapture("([0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4})\\s?(.*)", txt, list(date="", text=""))
  data.frame(first=dat$date[1],  last=dat$date[nrow(dat)], note=dat$text[nrow(dat)])
}))
#       first       last                   note
# 1 9/10/2021 10/09/2021 Sent notice to driver.
# 2 9/10/2021 10/09/2021 Sent notice to driver.

CodePudding user response：

You could try this:

library(stringr)
library(dplyr)

dates <- str_extract_all(entry, "\\d{1,2}/\\d{2}/\\d{4}")
text <- strsplit(entry, split = "(?<=\\d) ", perl=TRUE)

`First date` <- dates[[1]][1]
`Last date` <- dates[[1]][3]
note <- text[[1]][4]

df <- tibble(
  `First date`,
  `Last date`,
  note
)

df

# A tibble: 1 × 3
  `First date` `Last date` note                  
  <chr>        <chr>       <chr>                 
1 9/10/2021    10/09/2021  Sent notice to driver.