I have a large dataset with information on hospitalizations, date of admission and date of discharge.
I would like to find, for each hospitalization, the calendar month that contains the largest share of the hospitalization spell. In case the days between admission and discharge are equally distributed across months, I would like to associate it with the month of admission.
Below is a simple example:
hosp_dates <- data.table(id = c(1:5),
dt_admission = as.Date(c("2000-01-01", "2000-01-10", "2002-01-16", "2005-01-17", "2010-01-20")),
dt_discharge = as.Date(c("2000-01-20", "2000-02-02", "2002-02-16", "2005-02-16", "2010-03-31")))
hosp_dates %>%
mutate(month = c(1, 1, 1, 2, 3),
year = c(2000, 2000, 2002, 2005, 2010))
Output
id dt_admission dt_discharge month year
1: 1 2000-01-01 2000-01-20 1 2000
2: 2 2000-01-10 2000-02-02 1 2000
3: 3 2002-01-16 2002-02-16 1 2002
4: 4 2005-01-17 2005-02-16 2 2005
5: 5 2010-01-20 2010-03-31 3 2010
CodePudding user response:
À la brute-force (assuming distinct ids):
library(collapse)
hosp_dates[,
c('m', 'y') := {
sd = seq.Date(dt_admission, dt_discharge, 1L)
.(fmode(month(sd)), fmode(year(sd)))
},
by = id]
# id dt_admission dt_discharge m y
# <int> <Date> <Date> <int> <int>
# 1: 1 2000-01-01 2000-01-20 1 2000
# 2: 2 2000-01-10 2000-02-02 1 2000
# 3: 3 2002-01-16 2002-02-16 1 2002
# 4: 4 2005-01-17 2005-02-16 2 2005
# 5: 5 2010-01-20 2010-03-31 3 2010
If you have admissions lasting more than 11 months you'll get results different from what you are expecting and you might prefer something such as:
hosp_dates[,
ym := format(seq.Date(dt_admission, dt_discharge, 1L), "%Y-%m") |>
fmode(),
by = id]
# id dt_admission dt_discharge ym
# <int> <Date> <Date> <char>
# 1: 1 2000-01-01 2000-01-20 2000-01
# 2: 2 2000-01-10 2000-02-02 2000-01
# 3: 3 2002-01-16 2002-02-16 2002-01
# 4: 4 2005-01-17 2005-02-16 2005-02
# 5: 5 2010-01-20 2010-03-31 2010-03