Creating R loop/apply function to iterate by unique participant ID and calculate midpoint between da-CodePudding

My dataset looks something like this

dat <- data.frame(id=c(100,100,100,101,101,101,102,102,102,103,103,103),
visit = c(1,2,3,1,2,3,1,2,3,1,2,3),
visit.date = c(9/15/2020,11/29/2020,12/23/2020,9/7/2020,11/16/2020,12/9/2020,9/16/2020,12/6/2020,1/6/2021,10/4/2020,11/30/2020,12/23/2020),
delivery.date = c(NA,NA,NA,NA,11/2/2020,NA,NA,11/21/2020,NA,NA,11/15/2020,NA),
death = c(0,1,NA,0,0,0,0,0,1,0,0,1))

Essentially, I have three different visits for each participant with a unique ID. What I need to do is create a variable that states the date of death for each participant who reported a death (death=1). The date of death should be the midpoint between the visit date when the baby was reported dead and the last visit date that the baby was reported alive. Then, if the midpoint date is before the delivery date, I need that newly created variable column to list the delivery date as the day of death.

I've tried creating my own function and applying it using lapply as below, but I end up getting a separate report for each row that lists the participant ID, and the value is null. Here is the code I've tried. Ultimately, I will need to calculate person time at risk in days which is why I was trying to use difftime here. The dates are in POSIXct format as well to be compatible with the difftime function. Any help here would be very appreciated!

risktime <- function(id,dat) {
     a<- difftime(dat$visit.date[max(dat$visit)],dat$delivery.date,units="days")[dat["id"]=="id"]
a}
risktime1 <- lapply(unique(dat$id),risktime,dat)
riktime1

CodePudding user response：

I'm not sure why you use difftime, IMO you should use mean.Date instead which already gives the midpoint between two dates. Also works with "Date" class, that we create first.

dat <- transform(dat, visit.date=as.Date(visit.date, '%m/%d/%Y'),
                 delivery.date=as.Date(delivery.date, '%m/%d/%Y'))

Simply wrap mean.Date in a function. There is some case handling involved, whether there's 1. a death, 2.a delivery date which is earlier than reported death, 2.b if the latter is later.

f <- \(x) {
  if (any(x$death == 1)) {
    last_alive <- with(x, which.max(cumsum(death == 0)))
    first_dead <- with(x, which.max(cumsum(death == 1)))
    u <- mean(c(x$visit.date[last_alive], x$visit.date[first_dead]))
    dd <- x$delivery.date[!is.na(x$delivery.date)]
    if (!length(dd) == 0) {
      if (u < dd) {
        x$est_death <- dd
      } else {
        x$est_death <- u
      }
    } else {
      x$est_death <- u
    }
  } else {
    x$est_death <- as.Date(NA_integer_)
  }
  return(x)
}

Finally use function in by.

by(dat, dat$id, f) |> do.call(what=rbind)
#         id visit visit.date delivery.date death  est_death
# 100.1  100     1 2020-09-15          <NA>     0 2020-10-22
# 100.2  100     2 2020-11-29          <NA>     1 2020-10-22
# 100.3  100     3 2020-12-23          <NA>    NA 2020-10-22
# 101.4  101     1 2020-09-07          <NA>     0       <NA>
# 101.5  101     2 2020-11-16    2020-11-02     0       <NA>
# 101.6  101     3 2020-12-09          <NA>     0       <NA>
# 102.7  102     1 2020-09-16          <NA>     0 2020-12-21
# 102.8  102     2 2020-12-06    2020-11-21     0 2020-12-21
# 102.9  102     3 2021-01-06          <NA>     1 2020-12-21
# 103.10 103     1 2020-10-04          <NA>     0 2020-12-11
# 103.11 103     2 2020-11-30    2020-11-15     0 2020-12-11
# 103.12 103     3 2020-12-23          <NA>     1 2020-12-11

Data:

dat <- structure(list(id = c(100, 100, 100, 101, 101, 101, 102, 102, 
102, 103, 103, 103), visit = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 
2, 3), visit.date = structure(c(18520, 18595, 18619, 18512, 18582, 
18605, 18521, 18602, 18633, 18539, 18596, 18619), class = "Date"), 
    delivery.date = structure(c(NA, NA, NA, NA, 18568, NA, NA, 
    18587, NA, NA, 18581, NA), class = "Date"), death = c(0, 
    1, NA, 0, 0, 0, 0, 0, 1, 0, 0, 1)), class = "data.frame", row.names = c(NA, 
-12L))

CodePudding user response：

I personally find dplyr to be more expressive with these kinds of grouped operations than base R. Here’s how I would write this (using @jay.sf’s data):

library(dplyr, warn.conflicts = FALSE)

dat %>% 
  group_by(id) %>% 
  # Create a cumulative indicator for survival status
  mutate(
    dead = cummax(replace(death, is.na(death), 0))
  ) %>% 
  # Estimate date of death
  summarise(
    delivery.date = first(na.omit(delivery.date)),
    last.alive = last(visit.date[dead == 0]),
    first.dead = first(visit.date[dead == 1]),
    death.date = mean(c(last.alive, first.dead))
  ) %>% 
  # Ensure estimated death date is not before delivery
  mutate(
    death.date = replace(death.date, which(delivery.date > death.date), delivery.date)
  )
#> # A tibble: 4 x 5
#>      id delivery.date last.alive first.dead death.date
#>   <dbl> <date>        <date>     <date>     <date>    
#> 1   100 NA            2020-09-15 2020-11-29 2020-10-22
#> 2   101 2020-11-02    2020-12-09 NA         NA        
#> 3   102 2020-11-21    2020-12-06 2021-01-06 2020-12-21
#> 4   103 2020-11-15    2020-11-30 2020-12-23 2020-12-11