In R, use Lubridate to get a conditional average duration between events-CodePudding

Background

I've got a dataframe d:

d <- data.frame(ID = c("a","a","a","a","a","a","a","b","b","c","c"),
                treatment = c(0,1,0,0,0,1,0,1,0,1,1),
                #event = c(0,0,1,1,1,1,1,0,1,1,1),
                service_date = as.Date(c("2011-01-01",   
                                         "2011-08-21",   
                                         "2011-12-23",   
                                         "2012-02-23",   
                                         "2013-09-14",   
                                         "2013-04-07",   
                                         "2014-10-14",   
                                         "2013-01-01",
                                         "2013-12-12",   
                                         "2014-06-17",
                                         "2015-09-29")), 
                stringsAsFactors=FALSE)

> d
   ID treatment service_date
1   a         0   2011-01-01
2   a         1   2011-08-21
3   a         0   2011-12-23
4   a         0   2012-02-23
5   a         0   2013-09-14
6   a         1   2013-04-07
7   a         0   2014-10-14
8   b         1   2013-01-01
9   b         0   2013-12-12
10  c         1   2014-06-17
11  c         1   2015-09-29

It describes some people (ID), whether or not they had a treatment, and the date of each entry (row).

The Problem

I want to calculate the mean duration between the first and last treatment==1 for IDs who have more than 1 row where treatment==1.

To make that more clear, let's lay out the steps as if we were doing this manually, and also see what answer I want:

Take ID a. Mr. A has 7 rows of data, but only two rows in which treatment==1: one from 2011-08-21 (row 2) and another from 2013-09-14 (row 6). If you hand-calculate the difference, you find that there are 595 days between the two.
For ID b, we do nothing, as they only have 1 treatment==1. (We'll use filter to skip people like b in the code.)
For Mr. c, we get a difference of 469 days.
Average duration of treatment in this group: (595 days 469 days) / 2 people = 532 days. This is the desired result.

(It's entirely possible I've done this hand-calculation wrong, and that's fine, as long as it suffices to understand what I'm trying to do. Happy to clarify further if needed; let me know!)

What I've tried

I'm trying to adapt some old code from a similar query to work for this:

d %>%
  group_by(ID) %>%
  filter(sum(treatment) >1) %>%
  mutate(treatment_years = lubridate::time_length(max(service_date) - min(service_date), unit = "year")) %>%
  ungroup() %>%
  summarise(avg = mean(treatment_years),
            sd = sd(treatment_years))

This code runs, and gets me nearly there. It's filtering out the unwanted IDs and making a mean (and SD) calculation for a defined interval of time for each person.

But it's not quite correct: in lubridate::time_length, it isn't specifying the condition "max service date where treatment==1" minus "min service date where treatment==1". (The bolded parts are what's missing, and needed.)

How do I get it to do that?

I've tried something like this but it just throws an error:

d %>%
  group_by(ID) %>%
  filter(sum(treatment) >1) %>%
  mutate(treatment_years = lubridate::time_length(max(service_date) & treatment==1 - min(service_date) & treatment==1, unit = "year")) %>%
  ungroup() %>%
  summarise(avg = mean(treatment_years),
            sd = sd(treatment_years))

CodePudding user response：

We may subset the service_date with a logical vector treatment == 1 i.e. service_date[treatment == 1] (assuming there is at least one 'treatment' level 1)

library(dplyr)
library(lubridate)
d %>%
  group_by(ID) %>%
  filter(sum(treatment) >1) %>%
  summarise(treatment_years = lubridate::time_length(max(service_date[treatment == 1]) - 
                    min(service_date[treatment == 1]), unit = "day"), .groups = 'drop') %>%  
  summarise(avg = mean(treatment_years),
            sd = sd(treatment_years))

-output

# A tibble: 1 × 2
    avg    sd
  <dbl> <dbl>
1   532  89.1

CodePudding user response：

An option using by and just subtracting the treated dates.

by(d, d$ID, \(x) {
  if (all(x$treatment == 0)) NA_real_
  else diff(x$service_date[x$treatment == 1]) |> as.numeric()
}) |> unlist() |> {\(x) c(mean=mean(x, na.rm=TRUE), sd=sd(x, na.rm=TRUE))}()
#      mean        sd 
# 532.00000  89.09545