Let's say I have a very tall data frame of all the months each doctor is enrolled:
doctor | month_enrolled |
---|---|
Dr. Brown | March, 2021 |
Dr. Brown | April, 2021 |
Dr. Brown | July, 2021 |
Dr. Brown | August, 2021 |
Dr. Green | April, 2021 |
Dr. Green | May, 2021 |
Dr. Black | Jan, 2021 |
Dr. Black | May, 2021 |
Dr. Black | June, 2021 |
Dr. Black | July, 2021 |
Dr. Pink | September, 2021 |
... | ... |
And I want to consolidate this into a smaller data frame with ranges of enrollment_start and enrollment_end per doctor like this:
doctor | enrollment_start | enrollment_end |
---|---|---|
Dr. Brown | March, 2021 | April, 2021 |
Dr. Brown | July, 2021 | August, 2021 |
Dr. Green | April, 2021 | May, 2021 |
Dr. Black | Jan, 2021 | Jan, 2021 |
DR. Black | May, 2021 | July, 2021 |
Dr. Pink | September, 2021 | September, 2021 |
... | ... | ... |
Whereby non-continuous enrollment periods are reflected as separate rows.
How might I accomplish this in R, preferably using dplyr / tidyverse syntax?
CodePudding user response:
library(dplyr)
library(lubridate)
data.frame(
stringsAsFactors = FALSE,
doctor = c("Dr. Brown","Dr. Brown",
"Dr. Brown","Dr. Brown","Dr. Green","Dr. Green",
"Dr. Black","Dr. Black","Dr. Black","Dr. Black","Dr. Pink"),
month_enrolled = c("March, 2021","April, 2021",
"July, 2021","August, 2021","April, 2021","May, 2021",
"Jan, 2021","May, 2021","June, 2021","July, 2021",
"September, 2021")
) %>%
transmute(doctor, month = dmy(paste(1,month_enrolled))) %>%
group_by(doctor) %>%
mutate(stint = 1 cumsum(c(0, diff(month) > 31))) %>% # EDIT - shorter
group_by(doctor, stint) %>%
summarize(enrollment_start = min(month),
enrollment_end = max(month), .groups = "drop")
Result
# A tibble: 6 × 4
doctor stint enrollment_start enrollment_end
<chr> <dbl> <date> <date>
1 Dr. Black 1 2021-01-01 2021-01-01
2 Dr. Black 2 2021-05-01 2021-07-01
3 Dr. Brown 1 2021-03-01 2021-04-01
4 Dr. Brown 2 2021-07-01 2021-08-01
5 Dr. Green 1 2021-04-01 2021-05-01
6 Dr. Pink 1 2021-09-01 2021-09-01