R (data.table) Find calendar month with largest share of days in interval-CodePudding

I have a large dataset with information on hospitalizations, date of admission and date of discharge.

I would like to find, for each hospitalization, the calendar month that contains the largest share of the hospitalization spell. In case the days between admission and discharge are equally distributed across months, I would like to associate it with the month of admission.

Below is a simple example:

hosp_dates <- data.table(id = c(1:5), 
                         dt_admission = as.Date(c("2000-01-01", "2000-01-10", "2002-01-16", "2005-01-17", "2010-01-20")),
                         dt_discharge =  as.Date(c("2000-01-20", "2000-02-02", "2002-02-16", "2005-02-16", "2010-03-31")))

hosp_dates %>%
  mutate(month = c(1, 1, 1, 2, 3), 
         year = c(2000, 2000, 2002, 2005, 2010))

Output

   id dt_admission dt_discharge month year
1:  1   2000-01-01   2000-01-20     1 2000
2:  2   2000-01-10   2000-02-02     1 2000
3:  3   2002-01-16   2002-02-16     1 2002
4:  4   2005-01-17   2005-02-16     2 2005
5:  5   2010-01-20   2010-03-31     3 2010

CodePudding user response：

À la brute-force (assuming distinct ids):

library(collapse)
hosp_dates[, 
           c('m', 'y') := {
             sd = seq.Date(dt_admission, dt_discharge, 1L)
             .(fmode(month(sd)), fmode(year(sd)))
           }, 
           by = id]

#       id dt_admission dt_discharge     m     y
#    <int>       <Date>       <Date> <int> <int>
# 1:     1   2000-01-01   2000-01-20     1  2000
# 2:     2   2000-01-10   2000-02-02     1  2000
# 3:     3   2002-01-16   2002-02-16     1  2002
# 4:     4   2005-01-17   2005-02-16     2  2005
# 5:     5   2010-01-20   2010-03-31     3  2010

If you have admissions lasting more than 11 months you'll get results different from what you are expecting and you might prefer something such as:

hosp_dates[, 
           ym := format(seq.Date(dt_admission, dt_discharge, 1L), "%Y-%m") |>
             fmode(), 
           by = id]

#       id dt_admission dt_discharge      ym
#    <int>       <Date>       <Date>  <char>
# 1:     1   2000-01-01   2000-01-20 2000-01
# 2:     2   2000-01-10   2000-02-02 2000-01
# 3:     3   2002-01-16   2002-02-16 2002-01
# 4:     4   2005-01-17   2005-02-16 2005-02
# 5:     5   2010-01-20   2010-03-31 2010-03