My dataset looks something like this
dat <- data.frame(id=c(100,100,100,101,101,101,102,102,102,103,103,103),
visit = c(1,2,3,1,2,3,1,2,3,1,2,3),
visit.date = c(9/15/2020,11/29/2020,12/23/2020,9/7/2020,11/16/2020,12/9/2020,9/16/2020,12/6/2020,1/6/2021,10/4/2020,11/30/2020,12/23/2020),
delivery.date = c(NA,NA,NA,NA,11/2/2020,NA,NA,11/21/2020,NA,NA,11/15/2020,NA),
death = c(0,1,NA,0,0,0,0,0,1,0,0,1))
Essentially, I have three different visits for each participant with a unique ID. What I need to do is create a variable that states the date of death for each participant who reported a death (death=1). The date of death should be the midpoint between the visit date when the baby was reported dead and the last visit date that the baby was reported alive. Then, if the midpoint date is before the delivery date, I need that newly created variable column to list the delivery date as the day of death.
I've tried creating my own function and applying it using lapply as below, but I end up getting a separate report for each row that lists the participant ID, and the value is null. Here is the code I've tried. Ultimately, I will need to calculate person time at risk in days which is why I was trying to use difftime here. The dates are in POSIXct format as well to be compatible with the difftime function. Any help here would be very appreciated!
risktime <- function(id,dat) {
a<- difftime(dat$visit.date[max(dat$visit)],dat$delivery.date,units="days")[dat["id"]=="id"]
a}
risktime1 <- lapply(unique(dat$id),risktime,dat)
riktime1
CodePudding user response:
I'm not sure why you use difftime
, IMO you should use mean.Date
instead which already gives the midpoint between two dates. Also works with "Date"
class, that we create first.
dat <- transform(dat, visit.date=as.Date(visit.date, '%m/%d/%Y'),
delivery.date=as.Date(delivery.date, '%m/%d/%Y'))
Simply wrap mean.Date
in a function. There is some case handling involved, whether there's 1. a death, 2.a delivery date which is earlier than reported death, 2.b if the latter is later.
f <- \(x) {
if (any(x$death == 1)) {
last_alive <- with(x, which.max(cumsum(death == 0)))
first_dead <- with(x, which.max(cumsum(death == 1)))
u <- mean(c(x$visit.date[last_alive], x$visit.date[first_dead]))
dd <- x$delivery.date[!is.na(x$delivery.date)]
if (!length(dd) == 0) {
if (u < dd) {
x$est_death <- dd
} else {
x$est_death <- u
}
} else {
x$est_death <- u
}
} else {
x$est_death <- as.Date(NA_integer_)
}
return(x)
}
Finally use function in by
.
by(dat, dat$id, f) |> do.call(what=rbind)
# id visit visit.date delivery.date death est_death
# 100.1 100 1 2020-09-15 <NA> 0 2020-10-22
# 100.2 100 2 2020-11-29 <NA> 1 2020-10-22
# 100.3 100 3 2020-12-23 <NA> NA 2020-10-22
# 101.4 101 1 2020-09-07 <NA> 0 <NA>
# 101.5 101 2 2020-11-16 2020-11-02 0 <NA>
# 101.6 101 3 2020-12-09 <NA> 0 <NA>
# 102.7 102 1 2020-09-16 <NA> 0 2020-12-21
# 102.8 102 2 2020-12-06 2020-11-21 0 2020-12-21
# 102.9 102 3 2021-01-06 <NA> 1 2020-12-21
# 103.10 103 1 2020-10-04 <NA> 0 2020-12-11
# 103.11 103 2 2020-11-30 2020-11-15 0 2020-12-11
# 103.12 103 3 2020-12-23 <NA> 1 2020-12-11
Data:
dat <- structure(list(id = c(100, 100, 100, 101, 101, 101, 102, 102,
102, 103, 103, 103), visit = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1,
2, 3), visit.date = structure(c(18520, 18595, 18619, 18512, 18582,
18605, 18521, 18602, 18633, 18539, 18596, 18619), class = "Date"),
delivery.date = structure(c(NA, NA, NA, NA, 18568, NA, NA,
18587, NA, NA, 18581, NA), class = "Date"), death = c(0,
1, NA, 0, 0, 0, 0, 0, 1, 0, 0, 1)), class = "data.frame", row.names = c(NA,
-12L))
CodePudding user response:
I personally find dplyr
to be more expressive with these kinds of grouped
operations than base R. Here’s how I would write this (using @jay.sf’s data):
library(dplyr, warn.conflicts = FALSE)
dat %>%
group_by(id) %>%
# Create a cumulative indicator for survival status
mutate(
dead = cummax(replace(death, is.na(death), 0))
) %>%
# Estimate date of death
summarise(
delivery.date = first(na.omit(delivery.date)),
last.alive = last(visit.date[dead == 0]),
first.dead = first(visit.date[dead == 1]),
death.date = mean(c(last.alive, first.dead))
) %>%
# Ensure estimated death date is not before delivery
mutate(
death.date = replace(death.date, which(delivery.date > death.date), delivery.date)
)
#> # A tibble: 4 x 5
#> id delivery.date last.alive first.dead death.date
#> <dbl> <date> <date> <date> <date>
#> 1 100 NA 2020-09-15 2020-11-29 2020-10-22
#> 2 101 2020-11-02 2020-12-09 NA NA
#> 3 102 2020-11-21 2020-12-06 2021-01-06 2020-12-21
#> 4 103 2020-11-15 2020-11-30 2020-12-23 2020-12-11