I have a dataframe with ID and treatment dates like this as below
ID Dates
1 01/2/2012
1 02/8/2012
1 03/8/2012
1 04/5/2013
1 05/5/2013
2 01/2/2012
2 03/5/2013
2 04/6/2013
I need to find for each ID, if there is a treatment date break for more than a year. If yes, then I need to break them into two courses, and list the start & end date. So after executing R codes, it will look like below:
ID Course1StarteDate Course1EndDate Break1to2(Yr) Course2StartDate Course2EndDate
1 01/2/2012 03/8/2012 1.075 04/5/2013 05/5/2013
2 01/2/2012 01/2/2012 1.173 03/5/2013 04/6/2013
The dataframe I have includes hundreds of IDs, and I don't know how many courses there will be. Is there an efficient way of using R to solve this? Thanks in advance!
CodePudding user response:
If d
is your data, you can identify when the difference between a row's date and the prior row's date exceeds 365 (or perhaps 365.25), and, then use cumsum
to generate the distinct treatment courses. Finally add a column that estimates the duration of the "break" between courses.
as_tibble(d) %>%
group_by(ID) %>%
mutate(trt=as.numeric(Dates-lag(Dates)),
trt=cumsum(if_else(is.na(trt),0,trt)>365) 1) %>%
group_by(ID,trt) %>%
summarize(StartDate = min(Dates),
EndDate = max(Dates),.groups = "drop_last") %>%
mutate(Break:=as.numeric(lead(StartDate) - EndDate)/365)
Output:
ID trt StartDate EndDate Break
<dbl> <dbl> <date> <date> <dbl>
1 1 1 2012-01-02 2012-03-08 1.08
2 1 2 2013-04-05 2013-05-05 NA
3 2 1 2012-01-02 2012-01-02 1.17
4 2 2 2013-03-05 2013-04-06 NA
I would suggest keeping in this long format, rather than swinging to wide format as you have in your example, especially with hundreds of IDs, all with potentially different numbers of courses. The long format is almost always better.
However, if you really want this, you can continue the pipeline from above, like this:
ungroup %>%
pivot_wider(id_cols =ID,
names_from = trt,
values_from = c(StartDate:Break),
names_glue = "Course{trt}_{.value}",
names_vary = "slowest")
to produce this "wide" format:
ID Course1_StartDate Course1_EndDate Course1_Break Course2_StartDate Course2_EndDate Course2_Break
<dbl> <date> <date> <dbl> <date> <date> <dbl>
1 1 2012-01-02 2012-03-08 1.08 2013-04-05 2013-05-05 NA
2 2 2012-01-02 2012-01-02 1.17 2013-03-05 2013-04-06 NA