I have a few large enrolment datasets and I'm trying to create two things:
- I'd like to flag each uninterrupted monthly observation (
final_df1
) - I'd like to create a dataset of uninterrupted spans (
final_df2
)
For example:
library(tidyverse)
library(lubridate)
library(magrittr)
df<-tibble(id=c(rep("X",10),rep("Y",20)),
date=c(ymd("20120101")%m %months(c(1:5,7:11)),ymd("20120401")%m %months(c(1:10,12:17,19:22))))
final_df1 <- df %>% mutate(cont_enroll=c(rep(1,5),rep(0,5),rep(1,10),rep(0,10)))
final_df2 <- tibble(id=c(rep("X",2),rep("Y",3)),
span_start=c(ymd("20120101")%m %months(1),
ymd("20120101")%m %months(7),
ymd("20120401")%m %months(1),
ymd("20120101")%m %months(12),
ymd("20120101")%m %months(19)),
span_end=c(ymd("20120101")%m %months(5),
ymd("20120101")%m %months(11),
ymd("20120101")%m %months(10),
ymd("20120101")%m %months(17),
ymd("20120101")%m %months(22))
)
I feel like there must be a simple way to do this between {lubridate} and {data.table} but I'm drawing up blanks. Any tips?
CodePudding user response:
Grouped by 'id', create an interval
with the previous value of 'date' (lag
) and the current 'date', divide by the months
, check if it is less than 2, and take the cumulative minimum (cummin
). After creating the 'find_df_new', then we group by 'id' and the run-length-id of 'cont_enroll' column, and summarise
with the first
and last
value of 'date' to create the 'span_start' and 'span_end' respectively
library(dplyr)
library(lubridate)
library(data.table)
final_df_new <- df %>%
group_by(id) %>%
mutate(cont_enroll2 = cummin(interval(lag(date, default = first(date)),
date) /months(1) < 2)) %>%
ungroup
final_df_new %>%
group_by(id, grp = rleid(cont_enroll2)) %>%
summarise(span_start = first(date), span_end = last(date), .groups = 'drop')