how to aggregate data by month using study start and end dates in R to calculate monthly disease pre-CodePudding

I have cohort study data with start and end dates for each patient. Each patient enters and exists the study at completely different dates. I would like to calculate the prevalence of a disease in each month. How can I aggregate the data in a way that shows the total number of study patients per month (n_total) and total number of patients with the disease (n_disease) per month which would allow me to calculate the prevalence (n_disease/n_total) noting that for months where there are 0 patients and 0 prevalence I still want a row to be present and to say 0 for prevalence. Also, if people leave in say for example in May, they would still be counted for the population of that month even if they were only present for some of that month.

This is how my data currently looks like:

patid	start_date	end_date	disease	disease_date
1	01/03/2016	31/08/2021	yes	15/11/2017
2	24/03/2020	31/08/2021	no	NA
3	01/03/2020	23/08/2021	yes	15/08/2020
4	24/03/2016	01/08/2019	no	NA
5	24/03/2018	17/08/2020	no	NA
6	01/03/2016	04/08/2018	yes	01/01/2017
7	01/03/2016	31/08/2018	yes	18/03/2017

Sample data:

df <- data.frame(patid=c("1","2","3","4","5","6","7"), 
                 
                 start_date=c("01/03/2016","24/08/2016", 
                              "01/01/2016","24/02/2016", 
                              "24/04/2016","01/04/2016", 
                              "01/09/2016"), 
                 
                 end_date=c("31/12/2016","31/12/2016", 
                            "23/12/2016","01/08/2016", 
                            "17/06/2016","04/05/2016", 
                            "31/10/2016"), 
                 
                 disease=c("yes","no","yes","no", 
                           "no","yes","yes"), 
                 
                 disease_date=c("15/08/2016",NA, 
                                "15/08/2016",NA,NA, 
                                "01/05/2016","31/10/2016") )

The outcome of this would look like this:

structure(list(month_year = c("01/2016", "02/2016", "03/2016", 
"04/2016", "05/2016", "06/2016", "07/2016", "08/2016", "09/2016", 
"10/2016", "11/2016", "12/2016"), n_total = c("1", "2", "3", 
"5", "5", "4", "3", "4", "4", "4", "3", "3"), n_disease = c("0", 
"0", "0", "0", "1", "0", "0", "2", "0", "1", "0", "0"), prevalence = c("0", 
"0", "0", "0", "0.2", "0", "0", "0.5", "0", "0.25", "0", "0")), class = "data.frame", row.names = c(NA, 
-12L))

CodePudding user response：

What about this:

df <- data.frame(patid=c("1","2","3","4","5","6","7","8","9","10","11","12"), 
                 
                 start_date=c("01/03/2016","24/08/2016", 
                              "01/01/2016","24/02/2016", 
                              "24/04/2016","01/04/2016", 
                              "01/09/2016","01/03/2016",
                              "24/08/2016","01/01/2016",
                              "24/02/2016","24/04/2016"), 
                 
                 end_date=c("31/12/2016","31/12/2016", 
                            "23/12/2016","01/08/2016", 
                            "17/06/2016","04/05/2016", 
                            "31/10/2016","31/12/2016", 
                            "23/12/2016","31/08/2016", 
                            "17/06/2016","04/12/2016"), 
                 
                 disease=c("yes","no","yes","no", 
                           "no","yes","yes","yes",
                           "no","yes","no", "no"), 
                 
                 disease_date=c("15/08/2016",NA, 
                                "15/08/2016",NA,NA, 
                                "01/05/2016","31/10/2016","15/11/2016",NA, 
                                "15/05/2016",NA,NA) )


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> Loading required package: timechange
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(tidyr)
df %>% 
  mutate(across(c(start_date, end_date, disease_date), dmy)) %>% 
  rowwise() %>% 
  mutate(month = list(seq.Date(start_date, end_date, by="months"))) %>%
  unnest(month) %>% 
  mutate(month = zoo::as.yearmon(month), 
         disease_01 = ifelse(zoo::as.yearmon(disease_date) == month, 1, 0)) %>% 
  group_by(month) %>% 
  summarise(n_total = length(disease_01), 
            n_disease = sum(disease_01, na.rm=TRUE))
#> # A tibble: 12 × 3
#>    month     n_total n_disease
#>    <yearmon>   <int>     <dbl>
#>  1 Jan 2016        2         0
#>  2 Feb 2016        4         0
#>  3 Mar 2016        6         0
#>  4 Apr 2016        9         0
#>  5 May 2016        9         2
#>  6 Jun 2016        6         0
#>  7 Jul 2016        6         0
#>  8 Aug 2016        7         2
#>  9 Sep 2016        7         0
#> 10 Oct 2016        7         1
#> 11 Nov 2016        6         1
#> 12 Dec 2016        4         0

^{Created on 2023-01-24 by the reprex package (v2.0.1)}

I think your expected output calculations are wrong because you clearly have two patients in January 2016 (patient IDs 3 and 10), though your expected output only has one in n_total.

CodePudding user response：

Here's a solution that uses ivs (for interval vectors), clock (for month precision dates), and vctrs (for counting matches).

Note that ivs requires half-open intervals, which in practice means that we add 1 to our "end" months before creating the interval vector.

The real stars of the show are:

vec_count_matches() to count each time a month appeared in disease_date, which gives us our n_disease
iv_count_between() to count each time a month fell between a range, which gives us our n_total

It should be very performant.

library(dplyr, warn.conflicts = FALSE)
library(clock)
library(ivs)
library(vctrs)

df <- tibble(
  patid=c("1","2","3","4","5","6","7"), 
  start_date=c("01/03/2016","24/08/2016", 
               "01/01/2016","24/02/2016", 
               "24/04/2016","01/04/2016", 
               "01/09/2016"), 
  end_date=c("31/12/2016","31/12/2016", 
             "23/12/2016","01/08/2016", 
             "17/06/2016","04/05/2016", 
             "31/10/2016"), 
  disease=c("yes","no","yes","no", 
            "no","yes","yes"), 
  disease_date=c("15/08/2016",NA, 
                 "15/08/2016",NA,NA, 
                 "01/05/2016","31/10/2016")
)

# Only need these cols
df <- df %>%
  select(start_date, end_date, disease_date)

# Turn into actual dates
df <- df %>%
  mutate(
    across(everything(), \(col) {
      date_parse(col, format = "%d/%m/%Y")
    })
  )

# We really only need month based information, so drop the days
df <- df %>%
  mutate(
    across(everything(), \(col) {
      calendar_narrow(as_year_month_day(col), "month")
    })
  )

# Turn the start/end dates into real ranges.
# Make them half-open ranges by adding 1 to the end date month
df <- df %>%
  mutate(range = iv(start_date, end_date   1L), .keep = "unused", .before = 1)

df
#> # A tibble: 7 × 2
#>                range disease_date
#>     <iv<ymd<month>>> <ymd<month>>
#> 1 [2016-03, 2017-01) 2016-08     
#> 2 [2016-08, 2017-01) NA          
#> 3 [2016-01, 2017-01) 2016-08     
#> 4 [2016-02, 2016-09) NA          
#> 5 [2016-04, 2016-07) NA          
#> 6 [2016-04, 2016-06) 2016-05     
#> 7 [2016-09, 2016-11) 2016-10

# Little helper to count the number of times each `needle` appears in `haystack`
vec_count_matches <- function(needles, haystack) {
  out <- vec_rep(0L, times = vec_size(needles))
  matches <- vec_locate_matches(needles, haystack, no_match = "drop")
  result <- vec_count(matches$needles, sort = "location")
  out[result$key] <- result$count
  out
}

# Create a full sequence from min month to max month
from <- min(iv_start(df$range))
to <- max(iv_end(df$range))

tibble(
  month = seq(from = from, to = to, by = 1),
  n_disease = vec_count_matches(month, df$disease_date),
  n_total = iv_count_between(month, df$range),
  prevalence = n_disease / n_total
)
#> # A tibble: 13 × 4
#>    month        n_disease n_total prevalence
#>    <ymd<month>>     <int>   <int>      <dbl>
#>  1 2016-01              0       1       0   
#>  2 2016-02              0       2       0   
#>  3 2016-03              0       3       0   
#>  4 2016-04              0       5       0   
#>  5 2016-05              1       5       0.2 
#>  6 2016-06              0       4       0   
#>  7 2016-07              0       3       0   
#>  8 2016-08              2       4       0.5 
#>  9 2016-09              0       4       0   
#> 10 2016-10              1       4       0.25
#> 11 2016-11              0       3       0   
#> 12 2016-12              0       3       0   
#> 13 2017-01              0       0     NaN

^{Created on 2023-01-24 with reprex v2.0.2.9000}