Home > Software engineering >  Use dplyr to aggregate counts by month from start-stop ranged variables?
Use dplyr to aggregate counts by month from start-stop ranged variables?

Time:03-26

Let's say I have school enrollment data stored in this format, with start date and end date fields:

unique_name enrollment_start enrollment_end
Amy 1, Jan, 2017 30, Sep 2018
Franklin 1, Jan, 2017 19, Feb, 2017
Franklin 5, Jun, 2017 4, Feb, 2018
Franklin 21, Oct, 2018 9, Mar, 2019
Samir 1, Jun, 2017 4, Feb, 2017
Samir 5, Apr, 2017 12, Sep, 2018
... ... ...

And I want to produce aggregated counts of enrollment by month like this:

month enrollment_count
Jan, 2017 25
Feb, 2017 31
Mar, 2017 19
Apr, 2017 34
May, 2017 29
Jun, 2017 32
... ...

Is there an easy way to accomplish this with dplyr?

The only way I can think to do this is by looping over a list of all months from range month_min to month_max to count the number of rows with start or stop dates that fall inside each month. Hoping for easier code.

CodePudding user response:

Create a list column containing the sequence of months between each set of dates, then unnest and count.

Notes:

  1. I use lubridate::floor_date() to round enrollment_start to the first day of the month. Otherwise, seq() may skip months if enrollment_start is on the 29th of the month or later.
  2. The fifth row of your example data has enrollment_start later than enrollment_end -- I assumed this was an error and removed.
library(tidyverse)
library(lubridate)

enrollments %>% 
  mutate(
    across(c(enrollment_start, enrollment_end), dmy),  # convert to date
    month = map2(
      floor_date(enrollment_start, unit = "month"),    # round to 1st day
      enrollment_end,
      ~ seq(.x, .y, by = "month")
    )
  ) %>% 
  unnest_longer(month) %>% 
  count(month, name = "enrollment_count")

#> # A tibble: 27 x 2
#>    month      enrollment_count
#>    <date>                <int>
#>  1 2017-01-01                2
#>  2 2017-02-01                2
#>  3 2017-03-01                1
#>  4 2017-04-01                2
#>  5 2017-05-01                2
#>  6 2017-06-01                3
#>  7 2017-07-01                3
#>  8 2017-08-01                3
#>  9 2017-09-01                3
#> 10 2017-10-01                3
#> # ... with 17 more rows

Created on 2022-03-25 by the reprex package (v2.0.1)

CodePudding user response:

Here's my take on this with dplyr and tidyr.

  1. Pivot the data creating multiple rows per student and format your dates.
  2. group on student and generate missing months using complete.
  3. group on the generated periods and count.
data %>%
  pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
    mutate(value = as.Date(value, format =  "%d, %B, %Y")) %>%
    mutate(value = lubridate::floor_date(value, 'month')) %>%
  
#   unique_name name             value     
#   <chr>       <chr>            <date>    
# 1 Amy         enrollment_start 2017-01-01
# 2 Amy         enrollment_end   2018-09-30
# 3 Franklin    enrollment_start 2017-01-01
# 4 Franklin    enrollment_end   2017-02-19
#   ..etc.

  group_by(unique_name) %>%
  complete(value = seq.Date(min(value), max(value), by="month")) %>%
  arrange(unique_name, value) 

enrollment_count <- group_by(data, value) %>%
  count()

Edit: I forgot to floor the dates in order to properly aggregate per period at the end. Added floor_date from lubridate to do this.

  • Related