Updating Dates and Date Intervals in R-CodePudding

Not even sure if I've described the problem accurately in the title, but here goes.

Suppose I have the following data.table/data.frame:

library(data.table)
library(lubridate)


DT <- data.table(begin = c("2019-06-01 09:00:00","2019-06-01 09:00:00", "2019-06-01 09:00:00",
                           "2019-06-01 09:00:00", "2016-06-01 09:00:00","2016-06-01 09:00:00"),
                 end = c("2019-06-03 14:00:00", "2019-06-03 14:00:00", "2019-06-03 14:00:00",
                         "2019-06-02 05:00:00", "2019-06-02 05:00:00", "2016-06-01 23:15:00"),
                 person = c("A", "A","A", "B", "B", "C"))

    begin                 end person
1: 2019-06-01 09:00:00 2019-06-03 14:00:00      A
2: 2019-06-01 09:00:00 2019-06-03 14:00:00      A
3: 2019-06-01 09:00:00 2019-06-03 14:00:00      A
4: 2019-06-01 09:00:00 2019-06-02 05:00:00      B
5: 2016-06-01 09:00:00 2019-06-02 05:00:00      B
6: 2016-06-01 09:00:00 2016-06-01 23:15:00      C

This is essentially a dataset summarizing time stamps of when a period began and ended for each person. The number of rows are repeated for each person by the number of days which the time period spans. For example, person A has three entries for the same "shift" because their shift spans three distinct dates, 06-01, 06-02, and 06-03. The entries are repeated by the number of dates which the shifts span, but some shifts begin and end within the same day.

What I want is to update the begin and end dates of the above dataset, so that I can see what time each shift began and ended at the day level. So the dataset should look like:

    begin                 end                person
1: 2019-06-01 09:00:00 2019-06-02 00:00:00      A
2: 2019-06-02 00:00:00 2019-06-03 00:00:00      A
3: 2019-06-03 00:00:00 2019-06-03 14:00:00      A
4: 2019-06-01 09:00:00 2019-06-02 00:00:00      B
5: 2016-06-02 00:00:00 2019-06-02 05:00:00      B
6: 2016-06-01 09:00:00 2016-06-01 23:15:00      C

Any help would be greatly appreciated!

CodePudding user response：

First, fixing the dates (and I already fixed row 5's starting in 2016 and going through to 2019, seems unlikely),

DT[, c("begin", "end"):=lapply(.SD, as.POSIXct), .SDcols=c("begin", "end")]

## we get this
DT <- as.data.table(structure(list(begin = structure(c(1559394000, 1559394000, 1559394000, 1559394000, 1559394000, 1464786000), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1559584800, 1559584800, 1559584800, 1559466000, 1559466000, 1464837300), class = c("POSIXct", "POSIXt"), tzone = ""), person = c("A", "A", "A", "B", "B", "C")), row.names = c(NA, -6L), class = c("data.table", "data.frame")))

Second, we then create this function

func <- function(st, en) {
  midns <- lubridate::ceiling_date(seq(st, en, by = "day"), unit = "day")
  times <- unique(sort(c(midns[ st < midns & midns < en], st, en)))
  data.table(begin = times[-length(times)], end = times[-1])
}

Lastly, we use it, using by=.(person) to preserve that column in the output. I use DT since we do not need (or even want) duplicates for each shift/day:

unique(DT)[, rbindlist(Map(func, begin, end)), by = .(person)]
#    person               begin                 end
#    <char>              <POSc>              <POSc>
# 1:      A 2019-06-01 09:00:00 2019-06-02 00:00:00
# 2:      A 2019-06-02 00:00:00 2019-06-03 00:00:00
# 3:      A 2019-06-03 00:00:00 2019-06-03 14:00:00
# 4:      B 2019-06-01 09:00:00 2019-06-02 00:00:00
# 5:      B 2019-06-02 00:00:00 2019-06-02 05:00:00
# 6:      C 2016-06-01 09:00:00 2016-06-01 23:15:00

CodePudding user response：

Assuming you had a typo for row 5 person B (begin 2019 not 2016):

> DT <- data.table(begin = c("2019-06-01 09:00:00","2019-06-01 09:00:00", "2019-06-01 09:00:00",
                             "2019-06-01 09:00:00", "2019-06-01 09:00:00","2016-06-01 09:00:00"),
                   end = c("2019-06-03 14:00:00", "2019-06-03 14:00:00", "2019-06-03 14:00:00",
                           "2019-06-02 05:00:00", "2019-06-02 05:00:00", "2016-06-01 23:15:00"),
                   person = c("A", "A","A", "B", "B", "C"))
>                  
> DT[, min_day:=as.numeric(difftime(end,begin, units="mins"))/ceiling(as.numeric(difftime(end,begin, units="days"))), person]
> unique(DT)
                 begin                 end person min_day
1: 2019-06-01 09:00:00 2019-06-03 14:00:00      A    1060
2: 2019-06-01 09:00:00 2019-06-02 05:00:00      B    1200
3: 2016-06-01 09:00:00 2016-06-01 23:15:00      C     855