I am trying to see how I can create a variable which summarizes observations across multiple dates.
library(data.table)
library(lubridate)
library(magrittr)
sample <- data.table(start = c("2018-12-22 23:00:00",
"2018-12-23 06:00:00",
"2018-12-22 06:00:00",
"2018-12-23 06:00:00"),
end = c("2018-12-23 06:00:00",
"2018-12-23 13:00:00",
"2018-12-23 12:00:00",
"2018-12-24 01:00:00"),
store = c("A", "A", "B", "B"))
sample[, start:= ymd_hms(start)]
sample[, end := ymd_hms(end)]
sample
> sample
start end store
1: 2018-12-22 23:00:00 2018-12-23 06:00:00 A
2: 2018-12-23 06:00:00 2018-12-23 13:00:00 A
3: 2018-12-22 06:00:00 2018-12-23 12:00:00 B
4: 2018-12-23 06:00:00 2018-12-24 01:00:00 B
Here, sample
is a time card of "shifts" used across each store. We see that store A has two observations, each with a start and end time. If there was no "bleeding" across dates (e.g. first observation begins on 2018-12-22 and ends on 2018-12-23), I would simply subtract the start and end times, and sum across the stores to get the total amount of minutes used across each stores. Something like:
worked_mins <- sample %>%
.[, date := ymd(substr(start,1,10))] %>%
.[, minutes := end - start] %>%
.[, .(worked_mins = sum(minutes)), by = .(store,date)]
However, I am trying to see how to best sum the number of minutes when shifts overlap across multiple days (potentially even >=2 days).
From the above, the desired output would be:
worked_mins = data.table(store = c("A","A", "B", "B", "B"),
date = c("2018-12-22", "2018-12-23",
"2018-12-22", "2018-12-23",
"2018-12-24"),
worked_mins = c(1, 13, 18, 30, 1))
> worked_mins
store date worked_mins
1: A 2018-12-22 1
2: A 2018-12-23 13
3: B 2018-12-22 18
4: B 2018-12-23 30
5: B 2018-12-24 1
Thanks!
CodePudding user response:
An updated solution that counts actual time, not just counting hours. This should take into account fractional hours.
library(lubridate) # ceiling_date, floor_date
func <- function(st, en, units = "hours") {
midns <- ceiling_date(seq(st, en, by = "day"), unit = "day")
times <- unique(sort(c(midns[ st < midns & midns < en], st, en)))
if (length(times) < 2) {
data.table(date = as.Date(floor_date(st)), d = structure(0, class = "difftime", units = units))
} else {
data.table(date = as.Date(floor_date(times[-length(times)], unit = "days")), d = `units<-`(diff(times), units))
}
}
sample[, rbindlist(Map(func, start, end)), by = .(store)
][, .(d = sum(d)), by = .(store, date)]
# store date d
# <char> <Date> <difftime>
# 1: A 2018-12-22 1 hours
# 2: A 2018-12-23 13 hours
# 3: B 2018-12-22 18 hours
# 4: B 2018-12-23 30 hours
# 5: B 2018-12-24 1 hours
(The 1 hours
is still a numeric column, it just has a label of its units attached; this can be removed easily by wrapping the diff
in as.numeric
.)
func
works by including midnights between st
and en
; creating a times
ordered vector of these unique timestamps allows us to diff
across them, then floor_date
them so that we know the date that each diff started.
You can see what func
is doing with this quick demo, one that makes the first line a 0-second difference (for testing and validation):
copy(sample)[1, end:=start][, rbindlist(Map(func, start, end)), by = .(store)]
# store date d
# <char> <Date> <difftime>
# 1: A 2018-12-22 0 hours
# 2: A 2018-12-23 7 hours
# 3: B 2018-12-22 18 hours
# 4: B 2018-12-23 12 hours
# 5: B 2018-12-23 18 hours
# 6: B 2018-12-24 1 hours
CodePudding user response:
Does this achieve what you need?
sample %>%
rowwise() %>%
mutate(
worked_hours = map2(start, end, ~seq(.x, .y, "hours") %>% head(-1))
) %>%
unnest(cols = c(worked_hours)) %>%
select(store, worked_hours) %>%
mutate(date = floor_date(worked_hours, "days")) %>%
group_by(store, date) %>%
count(name = "worked_mins")
# A tibble: 5 x 3
# Groups: store, date [5]
store date worked_mins
<chr> <dttm> <int>
1 A 2018-12-22 00:00:00 1
2 A 2018-12-23 00:00:00 13
3 B 2018-12-22 00:00:00 18
4 B 2018-12-23 00:00:00 30
5 B 2018-12-24 00:00:00 1