Home > OS >  Find treatment course breaks longer than a year based on dates
Find treatment course breaks longer than a year based on dates

Time:03-29

I have a dataframe with ID and treatment dates like this as below

ID  Dates
1   01/2/2012
1   02/8/2012
1   03/8/2012
1   04/5/2013
1   05/5/2013
2   01/2/2012
2   03/5/2013
2   04/6/2013

I need to find for each ID, if there is a treatment date break for more than a year. If yes, then I need to break them into two courses, and list the start & end date. So after executing R codes, it will look like below:

ID  Course1StarteDate  Course1EndDate Break1to2(Yr) Course2StartDate  Course2EndDate 
1   01/2/2012          03/8/2012      1.075         04/5/2013         05/5/2013
2   01/2/2012          01/2/2012      1.173         03/5/2013         04/6/2013

The dataframe I have includes hundreds of IDs, and I don't know how many courses there will be. Is there an efficient way of using R to solve this? Thanks in advance!

CodePudding user response:

If d is your data, you can identify when the difference between a row's date and the prior row's date exceeds 365 (or perhaps 365.25), and, then use cumsum to generate the distinct treatment courses. Finally add a column that estimates the duration of the "break" between courses.

as_tibble(d) %>% 
  group_by(ID) %>% 
  mutate(trt=as.numeric(Dates-lag(Dates)),
         trt=cumsum(if_else(is.na(trt),0,trt)>365) 1) %>% 
  group_by(ID,trt) %>% 
  summarize(StartDate = min(Dates),
            EndDate = max(Dates),.groups = "drop_last") %>% 
  mutate(Break:=as.numeric(lead(StartDate) - EndDate)/365)

Output:

     ID   trt StartDate  EndDate    Break
  <dbl> <dbl> <date>     <date>     <dbl>
1     1     1 2012-01-02 2012-03-08  1.08
2     1     2 2013-04-05 2013-05-05 NA   
3     2     1 2012-01-02 2012-01-02  1.17
4     2     2 2013-03-05 2013-04-06 NA   

I would suggest keeping in this long format, rather than swinging to wide format as you have in your example, especially with hundreds of IDs, all with potentially different numbers of courses. The long format is almost always better.

However, if you really want this, you can continue the pipeline from above, like this:

  ungroup %>%
  pivot_wider(id_cols =ID,
              names_from = trt,
              values_from = c(StartDate:Break),
              names_glue = "Course{trt}_{.value}",
              names_vary = "slowest")

to produce this "wide" format:

     ID Course1_StartDate Course1_EndDate Course1_Break Course2_StartDate Course2_EndDate Course2_Break
  <dbl> <date>            <date>                  <dbl> <date>            <date>                  <dbl>
1     1 2012-01-02        2012-03-08               1.08 2013-04-05        2013-05-05                 NA
2     2 2012-01-02        2012-01-02               1.17 2013-03-05        2013-04-06                 NA
  • Related