Home > Mobile >  Flagging continuous observations and creating enrolment spans
Flagging continuous observations and creating enrolment spans

Time:05-14

I have a few large enrolment datasets and I'm trying to create two things:

  1. I'd like to flag each uninterrupted monthly observation (final_df1)
  2. I'd like to create a dataset of uninterrupted spans (final_df2)

For example:

library(tidyverse)
library(lubridate)
library(magrittr)
df<-tibble(id=c(rep("X",10),rep("Y",20)),
           date=c(ymd("20120101")%m %months(c(1:5,7:11)),ymd("20120401")%m %months(c(1:10,12:17,19:22))))

final_df1 <- df %>% mutate(cont_enroll=c(rep(1,5),rep(0,5),rep(1,10),rep(0,10)))

final_df2 <- tibble(id=c(rep("X",2),rep("Y",3)),
                  span_start=c(ymd("20120101")%m %months(1),
                               ymd("20120101")%m %months(7),
                               ymd("20120401")%m %months(1),
                               ymd("20120101")%m %months(12),
                               ymd("20120101")%m %months(19)),
                  span_end=c(ymd("20120101")%m %months(5),
                             ymd("20120101")%m %months(11),
                             ymd("20120101")%m %months(10),
                             ymd("20120101")%m %months(17),
                             ymd("20120101")%m %months(22))
                  )

I feel like there must be a simple way to do this between {lubridate} and {data.table} but I'm drawing up blanks. Any tips?

CodePudding user response:

Grouped by 'id', create an interval with the previous value of 'date' (lag) and the current 'date', divide by the months, check if it is less than 2, and take the cumulative minimum (cummin). After creating the 'find_df_new', then we group by 'id' and the run-length-id of 'cont_enroll' column, and summarise with the first and last value of 'date' to create the 'span_start' and 'span_end' respectively

library(dplyr)
library(lubridate)
library(data.table)
final_df_new <- df %>%
   group_by(id)  %>%
   mutate(cont_enroll2 = cummin(interval(lag(date, default = first(date)), 
       date) /months(1) < 2))  %>%
   ungroup
final_df_new %>% 
  group_by(id, grp = rleid(cont_enroll2))  %>%
  summarise(span_start = first(date), span_end = last(date), .groups = 'drop')
  • Related