Let's say I have school enrollment data stored in this format, with start date and end date fields:
unique_name | enrollment_start | enrollment_end |
---|---|---|
Amy | 1, Jan, 2017 | 30, Sep 2018 |
Franklin | 1, Jan, 2017 | 19, Feb, 2017 |
Franklin | 5, Jun, 2017 | 4, Feb, 2018 |
Franklin | 21, Oct, 2018 | 9, Mar, 2019 |
Samir | 1, Jun, 2017 | 4, Feb, 2017 |
Samir | 5, Apr, 2017 | 12, Sep, 2018 |
... | ... | ... |
And I want to produce aggregated counts of enrollment by month like this:
month | enrollment_count |
---|---|
Jan, 2017 | 25 |
Feb, 2017 | 31 |
Mar, 2017 | 19 |
Apr, 2017 | 34 |
May, 2017 | 29 |
Jun, 2017 | 32 |
... | ... |
Is there an easy way to accomplish this with dplyr?
The only way I can think to do this is by looping over a list of all months from range month_min to month_max to count the number of rows with start or stop dates that fall inside each month. Hoping for easier code.
CodePudding user response:
Create a list column containing the sequence of months between each set of dates, then unnest and count.
Notes:
- I use
lubridate::floor_date()
to roundenrollment_start
to the first day of the month. Otherwise,seq()
may skip months ifenrollment_start
is on the 29th of the month or later. - The fifth row of your example data has
enrollment_start
later thanenrollment_end
-- I assumed this was an error and removed.
library(tidyverse)
library(lubridate)
enrollments %>%
mutate(
across(c(enrollment_start, enrollment_end), dmy), # convert to date
month = map2(
floor_date(enrollment_start, unit = "month"), # round to 1st day
enrollment_end,
~ seq(.x, .y, by = "month")
)
) %>%
unnest_longer(month) %>%
count(month, name = "enrollment_count")
#> # A tibble: 27 x 2
#> month enrollment_count
#> <date> <int>
#> 1 2017-01-01 2
#> 2 2017-02-01 2
#> 3 2017-03-01 1
#> 4 2017-04-01 2
#> 5 2017-05-01 2
#> 6 2017-06-01 3
#> 7 2017-07-01 3
#> 8 2017-08-01 3
#> 9 2017-09-01 3
#> 10 2017-10-01 3
#> # ... with 17 more rows
Created on 2022-03-25 by the reprex package (v2.0.1)
CodePudding user response:
Here's my take on this with dplyr
and tidyr
.
- Pivot the data creating multiple rows per student and format your dates.
- group on student and generate missing months using
complete
. - group on the generated periods and count.
data %>%
pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
mutate(value = as.Date(value, format = "%d, %B, %Y")) %>%
mutate(value = lubridate::floor_date(value, 'month')) %>%
# unique_name name value
# <chr> <chr> <date>
# 1 Amy enrollment_start 2017-01-01
# 2 Amy enrollment_end 2018-09-30
# 3 Franklin enrollment_start 2017-01-01
# 4 Franklin enrollment_end 2017-02-19
# ..etc.
group_by(unique_name) %>%
complete(value = seq.Date(min(value), max(value), by="month")) %>%
arrange(unique_name, value)
enrollment_count <- group_by(data, value) %>%
count()
Edit: I forgot to floor the dates in order to properly aggregate per period at the end. Added floor_date
from lubridate
to do this.