I have an interesting bit of data that is a function of the customer's data entry process. Each time there is an update the data entry team simply appends the date and relevant comments in the same Excel cell. Thus it looks like this...
entry <- "9/10/2021 received request to order more beer. 9/15/2021 Beer arrived in old truck 10/09/2021 Sent notice to driver."
There are really only two things the team requires going forward and those are extracting the first date, and the last date with the relevant text.
It needs to be in a data frame like this.
First date | Last date | note
----------- ------------- ----------------------
9/10/2021 | 10/09/2021 | Sent notice to driver
Thank you.
CodePudding user response:
Doubling entry
to show this works on vectors of strings:
entry <- rep(entry, 2)
Base R solution:
gre <- gregexpr("[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}.", entry)
# fix the "match.length" to extend until the next match or EOS
gre2 <- Map(function(G, txt) `attr<-`(G, "match.length", c(G[-1] - 1L, nchar(txt))), gre, entry)
do.call(rbind, lapply(regmatches(entry, gre2), function(txt) {
dat <- strcapture("([0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4})\\s?(.*)", txt, list(date="", text=""))
data.frame(first=dat$date[1], last=dat$date[nrow(dat)], note=dat$text[nrow(dat)])
}))
# first last note
# 1 9/10/2021 10/09/2021 Sent notice to driver.
# 2 9/10/2021 10/09/2021 Sent notice to driver.
CodePudding user response:
You could try this:
library(stringr)
library(dplyr)
dates <- str_extract_all(entry, "\\d{1,2}/\\d{2}/\\d{4}")
text <- strsplit(entry, split = "(?<=\\d) ", perl=TRUE)
`First date` <- dates[[1]][1]
`Last date` <- dates[[1]][3]
note <- text[[1]][4]
df <- tibble(
`First date`,
`Last date`,
note
)
df
# A tibble: 1 × 3
`First date` `Last date` note
<chr> <chr> <chr>
1 9/10/2021 10/09/2021 Sent notice to driver.