Use dplyr to aggregate counts by month from start-stop ranged variables?-CodePudding

Let's say I have school enrollment data stored in this format, with start date and end date fields:

unique_name	enrollment_start	enrollment_end
Amy	1, Jan, 2017	30, Sep 2018
Franklin	1, Jan, 2017	19, Feb, 2017
Franklin	5, Jun, 2017	4, Feb, 2018
Franklin	21, Oct, 2018	9, Mar, 2019
Samir	1, Jun, 2017	4, Feb, 2017
Samir	5, Apr, 2017	12, Sep, 2018
...	...	...

And I want to produce aggregated counts of enrollment by month like this:

month	enrollment_count
Jan, 2017	25
Feb, 2017	31
Mar, 2017	19
Apr, 2017	34
May, 2017	29
Jun, 2017	32
...	...

Is there an easy way to accomplish this with dplyr?

The only way I can think to do this is by looping over a list of all months from range month_min to month_max to count the number of rows with start or stop dates that fall inside each month. Hoping for easier code.

CodePudding user response：

Create a list column containing the sequence of months between each set of dates, then unnest and count.

Notes:

I use lubridate::floor_date() to round enrollment_start to the first day of the month. Otherwise, seq() may skip months if enrollment_start is on the 29th of the month or later.
The fifth row of your example data has enrollment_start later than enrollment_end -- I assumed this was an error and removed.

library(tidyverse)
library(lubridate)

enrollments %>% 
  mutate(
    across(c(enrollment_start, enrollment_end), dmy),  # convert to date
    month = map2(
      floor_date(enrollment_start, unit = "month"),    # round to 1st day
      enrollment_end,
      ~ seq(.x, .y, by = "month")
    )
  ) %>% 
  unnest_longer(month) %>% 
  count(month, name = "enrollment_count")

#> # A tibble: 27 x 2
#>    month      enrollment_count
#>    <date>                <int>
#>  1 2017-01-01                2
#>  2 2017-02-01                2
#>  3 2017-03-01                1
#>  4 2017-04-01                2
#>  5 2017-05-01                2
#>  6 2017-06-01                3
#>  7 2017-07-01                3
#>  8 2017-08-01                3
#>  9 2017-09-01                3
#> 10 2017-10-01                3
#> # ... with 17 more rows

^{Created on 2022-03-25 by the reprex package (v2.0.1)}

CodePudding user response：

Here's my take on this with dplyr and tidyr.

Pivot the data creating multiple rows per student and format your dates.
group on student and generate missing months using complete.
group on the generated periods and count.

data %>%
  pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
    mutate(value = as.Date(value, format =  "%d, %B, %Y")) %>%
    mutate(value = lubridate::floor_date(value, 'month')) %>%
  
#   unique_name name             value     
#   <chr>       <chr>            <date>    
# 1 Amy         enrollment_start 2017-01-01
# 2 Amy         enrollment_end   2018-09-30
# 3 Franklin    enrollment_start 2017-01-01
# 4 Franklin    enrollment_end   2017-02-19
#   ..etc.

  group_by(unique_name) %>%
  complete(value = seq.Date(min(value), max(value), by="month")) %>%
  arrange(unique_name, value) 

enrollment_count <- group_by(data, value) %>%
  count()

Edit: I forgot to floor the dates in order to properly aggregate per period at the end. Added floor_date from lubridate to do this.