Home > database >  Finding how many records are open in a month, in a dataset containing single records with start and
Finding how many records are open in a month, in a dataset containing single records with start and

Time:09-01

I have a data frame where each row is a record with its id, start and end dates. I would like to create another data frame that contains every calendar month's start dates (eg "2020-01-01" is January), and a second column counting how many unique records are open (for any/all portion of) that month.

I could create new columns for each calendar month and generate dummies for whether a record is open that month, then add up each column. What's a more efficient way of doing this?

ds <- data.frame(record_id = c("00a", "00b", "00c"),
                 record_start_date = as.Date(c("2020-01-16", "2020-03-25", "2020-02-22")),
                 record_end_date = as.Date(c("2020-12-05", "2020-06-21", "2020-11-12")))

CodePudding user response:

Here's an approach where we reshape the data and add rows for each month start. Then it can be a very efficient vectorized cumulative count to figure out the active records as of the end of the 1st of each month. If you want to count a record that ends on the 1st (or one that ends the same day it began) toward the count, you could add a line to shift end dates one day later.

library(tidyverse); library(lubridate)
ds %>%
  pivot_longer(-record_id) %>%
  mutate(change = if_else(name == "record_start_date", 1, -1)) %>%
  # mutate(value = value   if_else(name == "record_end_date", 1, 0)) %>%
  add_row(name = "month_start", 
          value = seq.Date(floor_date(min(ds$record_start_date), "month"),
                           floor_date(max(ds$record_end_date), "month"),
                           by = "month"),
          change = 0) %>%
  arrange(value, desc(name)) %>%
  mutate(count = cumsum(change)) %>%
  filter(name == "month_start") %>% 
  select(value, count)

Result:

# A tibble: 12 × 2
   value      count
   <date>     <dbl>
 1 2020-01-01     0
 2 2020-02-01     1
 3 2020-03-01     2
 4 2020-04-01     3
 5 2020-05-01     3
 6 2020-06-01     3
 7 2020-07-01     2
 8 2020-08-01     2
 9 2020-09-01     2
10 2020-10-01     2
11 2020-11-01     2
12 2020-12-01     1
  • Related