I have a data frame where each row is a record with its id, start and end dates. I would like to create another data frame that contains every calendar month's start dates (eg "2020-01-01" is January), and a second column counting how many unique records are open (for any/all portion of) that month.
I could create new columns for each calendar month and generate dummies for whether a record is open that month, then add up each column. What's a more efficient way of doing this?
ds <- data.frame(record_id = c("00a", "00b", "00c"),
record_start_date = as.Date(c("2020-01-16", "2020-03-25", "2020-02-22")),
record_end_date = as.Date(c("2020-12-05", "2020-06-21", "2020-11-12")))
CodePudding user response:
Here's an approach where we reshape the data and add rows for each month start. Then it can be a very efficient vectorized cumulative count to figure out the active records as of the end of the 1st of each month. If you want to count a record that ends on the 1st (or one that ends the same day it began) toward the count, you could add a line to shift end dates one day later.
library(tidyverse); library(lubridate)
ds %>%
pivot_longer(-record_id) %>%
mutate(change = if_else(name == "record_start_date", 1, -1)) %>%
# mutate(value = value if_else(name == "record_end_date", 1, 0)) %>%
add_row(name = "month_start",
value = seq.Date(floor_date(min(ds$record_start_date), "month"),
floor_date(max(ds$record_end_date), "month"),
by = "month"),
change = 0) %>%
arrange(value, desc(name)) %>%
mutate(count = cumsum(change)) %>%
filter(name == "month_start") %>%
select(value, count)
Result:
# A tibble: 12 × 2
value count
<date> <dbl>
1 2020-01-01 0
2 2020-02-01 1
3 2020-03-01 2
4 2020-04-01 3
5 2020-05-01 3
6 2020-06-01 3
7 2020-07-01 2
8 2020-08-01 2
9 2020-09-01 2
10 2020-10-01 2
11 2020-11-01 2
12 2020-12-01 1