Home > Back-end >  how to aggregate data by month using study start and end dates in R to calculate monthly disease pre
how to aggregate data by month using study start and end dates in R to calculate monthly disease pre

Time:01-25

I have cohort study data with start and end dates for each patient. Each patient enters and exists the study at completely different dates. I would like to calculate the prevalence of a disease in each month. How can I aggregate the data in a way that shows the total number of study patients per month (n_total) and total number of patients with the disease (n_disease) per month which would allow me to calculate the prevalence (n_disease/n_total) noting that for months where there are 0 patients and 0 prevalence I still want a row to be present and to say 0 for prevalence. Also, if people leave in say for example in May, they would still be counted for the population of that month even if they were only present for some of that month.

This is how my data currently looks like:

patid start_date end_date disease disease_date
1 01/03/2016 31/08/2021 yes 15/11/2017
2 24/03/2020 31/08/2021 no NA
3 01/03/2020 23/08/2021 yes 15/08/2020
4 24/03/2016 01/08/2019 no NA
5 24/03/2018 17/08/2020 no NA
6 01/03/2016 04/08/2018 yes 01/01/2017
7 01/03/2016 31/08/2018 yes 18/03/2017

Sample data:

df <- data.frame(patid=c("1","2","3","4","5","6","7"), 
                 
                 start_date=c("01/03/2016","24/08/2016", 
                              "01/01/2016","24/02/2016", 
                              "24/04/2016","01/04/2016", 
                              "01/09/2016"), 
                 
                 end_date=c("31/12/2016","31/12/2016", 
                            "23/12/2016","01/08/2016", 
                            "17/06/2016","04/05/2016", 
                            "31/10/2016"), 
                 
                 disease=c("yes","no","yes","no", 
                           "no","yes","yes"), 
                 
                 disease_date=c("15/08/2016",NA, 
                                "15/08/2016",NA,NA, 
                                "01/05/2016","31/10/2016") )



The outcome of this would look like this:

structure(list(month_year = c("01/2016", "02/2016", "03/2016", 
"04/2016", "05/2016", "06/2016", "07/2016", "08/2016", "09/2016", 
"10/2016", "11/2016", "12/2016"), n_total = c("1", "2", "3", 
"5", "5", "4", "3", "4", "4", "4", "3", "3"), n_disease = c("0", 
"0", "0", "0", "1", "0", "0", "2", "0", "1", "0", "0"), prevalence = c("0", 
"0", "0", "0", "0.2", "0", "0", "0.5", "0", "0.25", "0", "0")), class = "data.frame", row.names = c(NA, 
-12L))

CodePudding user response:

What about this:

df <- data.frame(patid=c("1","2","3","4","5","6","7","8","9","10","11","12"), 
                 
                 start_date=c("01/03/2016","24/08/2016", 
                              "01/01/2016","24/02/2016", 
                              "24/04/2016","01/04/2016", 
                              "01/09/2016","01/03/2016",
                              "24/08/2016","01/01/2016",
                              "24/02/2016","24/04/2016"), 
                 
                 end_date=c("31/12/2016","31/12/2016", 
                            "23/12/2016","01/08/2016", 
                            "17/06/2016","04/05/2016", 
                            "31/10/2016","31/12/2016", 
                            "23/12/2016","31/08/2016", 
                            "17/06/2016","04/12/2016"), 
                 
                 disease=c("yes","no","yes","no", 
                           "no","yes","yes","yes",
                           "no","yes","no", "no"), 
                 
                 disease_date=c("15/08/2016",NA, 
                                "15/08/2016",NA,NA, 
                                "01/05/2016","31/10/2016","15/11/2016",NA, 
                                "15/05/2016",NA,NA) )


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> Loading required package: timechange
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(tidyr)
df %>% 
  mutate(across(c(start_date, end_date, disease_date), dmy)) %>% 
  rowwise() %>% 
  mutate(month = list(seq.Date(start_date, end_date, by="months"))) %>%
  unnest(month) %>% 
  mutate(month = zoo::as.yearmon(month), 
         disease_01 = ifelse(zoo::as.yearmon(disease_date) == month, 1, 0)) %>% 
  group_by(month) %>% 
  summarise(n_total = length(disease_01), 
            n_disease = sum(disease_01, na.rm=TRUE))
#> # A tibble: 12 × 3
#>    month     n_total n_disease
#>    <yearmon>   <int>     <dbl>
#>  1 Jan 2016        2         0
#>  2 Feb 2016        4         0
#>  3 Mar 2016        6         0
#>  4 Apr 2016        9         0
#>  5 May 2016        9         2
#>  6 Jun 2016        6         0
#>  7 Jul 2016        6         0
#>  8 Aug 2016        7         2
#>  9 Sep 2016        7         0
#> 10 Oct 2016        7         1
#> 11 Nov 2016        6         1
#> 12 Dec 2016        4         0

Created on 2023-01-24 by the reprex package (v2.0.1)

I think your expected output calculations are wrong because you clearly have two patients in January 2016 (patient IDs 3 and 10), though your expected output only has one in n_total.

CodePudding user response:

Here's a solution that uses ivs (for interval vectors), clock (for month precision dates), and vctrs (for counting matches).

Note that ivs requires half-open intervals, which in practice means that we add 1 to our "end" months before creating the interval vector.

The real stars of the show are:

  • vec_count_matches() to count each time a month appeared in disease_date, which gives us our n_disease
  • iv_count_between() to count each time a month fell between a range, which gives us our n_total

It should be very performant.

library(dplyr, warn.conflicts = FALSE)
library(clock)
library(ivs)
library(vctrs)

df <- tibble(
  patid=c("1","2","3","4","5","6","7"), 
  start_date=c("01/03/2016","24/08/2016", 
               "01/01/2016","24/02/2016", 
               "24/04/2016","01/04/2016", 
               "01/09/2016"), 
  end_date=c("31/12/2016","31/12/2016", 
             "23/12/2016","01/08/2016", 
             "17/06/2016","04/05/2016", 
             "31/10/2016"), 
  disease=c("yes","no","yes","no", 
            "no","yes","yes"), 
  disease_date=c("15/08/2016",NA, 
                 "15/08/2016",NA,NA, 
                 "01/05/2016","31/10/2016")
)

# Only need these cols
df <- df %>%
  select(start_date, end_date, disease_date)

# Turn into actual dates
df <- df %>%
  mutate(
    across(everything(), \(col) {
      date_parse(col, format = "%d/%m/%Y")
    })
  )

# We really only need month based information, so drop the days
df <- df %>%
  mutate(
    across(everything(), \(col) {
      calendar_narrow(as_year_month_day(col), "month")
    })
  )

# Turn the start/end dates into real ranges.
# Make them half-open ranges by adding 1 to the end date month
df <- df %>%
  mutate(range = iv(start_date, end_date   1L), .keep = "unused", .before = 1)

df
#> # A tibble: 7 × 2
#>                range disease_date
#>     <iv<ymd<month>>> <ymd<month>>
#> 1 [2016-03, 2017-01) 2016-08     
#> 2 [2016-08, 2017-01) NA          
#> 3 [2016-01, 2017-01) 2016-08     
#> 4 [2016-02, 2016-09) NA          
#> 5 [2016-04, 2016-07) NA          
#> 6 [2016-04, 2016-06) 2016-05     
#> 7 [2016-09, 2016-11) 2016-10

# Little helper to count the number of times each `needle` appears in `haystack`
vec_count_matches <- function(needles, haystack) {
  out <- vec_rep(0L, times = vec_size(needles))
  matches <- vec_locate_matches(needles, haystack, no_match = "drop")
  result <- vec_count(matches$needles, sort = "location")
  out[result$key] <- result$count
  out
}

# Create a full sequence from min month to max month
from <- min(iv_start(df$range))
to <- max(iv_end(df$range))

tibble(
  month = seq(from = from, to = to, by = 1),
  n_disease = vec_count_matches(month, df$disease_date),
  n_total = iv_count_between(month, df$range),
  prevalence = n_disease / n_total
)
#> # A tibble: 13 × 4
#>    month        n_disease n_total prevalence
#>    <ymd<month>>     <int>   <int>      <dbl>
#>  1 2016-01              0       1       0   
#>  2 2016-02              0       2       0   
#>  3 2016-03              0       3       0   
#>  4 2016-04              0       5       0   
#>  5 2016-05              1       5       0.2 
#>  6 2016-06              0       4       0   
#>  7 2016-07              0       3       0   
#>  8 2016-08              2       4       0.5 
#>  9 2016-09              0       4       0   
#> 10 2016-10              1       4       0.25
#> 11 2016-11              0       3       0   
#> 12 2016-12              0       3       0   
#> 13 2017-01              0       0     NaN

Created on 2023-01-24 with reprex v2.0.2.9000

  • Related