Background
I've got a dataframe d
:
d <- data.frame(ID = c("a","a","a","a","a","a","a","b","b","c","c"),
treatment = c(0,1,0,0,0,1,0,1,0,1,1),
#event = c(0,0,1,1,1,1,1,0,1,1,1),
service_date = as.Date(c("2011-01-01",
"2011-08-21",
"2011-12-23",
"2012-02-23",
"2013-09-14",
"2013-04-07",
"2014-10-14",
"2013-01-01",
"2013-12-12",
"2014-06-17",
"2015-09-29")),
stringsAsFactors=FALSE)
> d
ID treatment service_date
1 a 0 2011-01-01
2 a 1 2011-08-21
3 a 0 2011-12-23
4 a 0 2012-02-23
5 a 0 2013-09-14
6 a 1 2013-04-07
7 a 0 2014-10-14
8 b 1 2013-01-01
9 b 0 2013-12-12
10 c 1 2014-06-17
11 c 1 2015-09-29
It describes some people (ID
), whether or not they had a treatment
, and the date of each entry (row).
The Problem
I want to calculate the mean duration between the first and last treatment==1
for ID
s who have more than 1 row where treatment==1
.
To make that more clear, let's lay out the steps as if we were doing this manually, and also see what answer I want:
Take
ID
a. Mr. A has 7 rows of data, but only two rows in whichtreatment==1
: one from 2011-08-21 (row 2) and another from 2013-09-14 (row 6). If you hand-calculate the difference, you find that there are 595 days between the two.For
ID
b, we do nothing, as they only have 1treatment==1
. (We'll usefilter
to skip people like b in the code.)For Mr. c, we get a difference of 469 days.
Average duration of treatment in this group: (595 days 469 days) / 2 people = 532 days. This is the desired result.
(It's entirely possible I've done this hand-calculation wrong, and that's fine, as long as it suffices to understand what I'm trying to do. Happy to clarify further if needed; let me know!)
What I've tried
I'm trying to adapt some old code from a similar query to work for this:
d %>%
group_by(ID) %>%
filter(sum(treatment) >1) %>%
mutate(treatment_years = lubridate::time_length(max(service_date) - min(service_date), unit = "year")) %>%
ungroup() %>%
summarise(avg = mean(treatment_years),
sd = sd(treatment_years))
This code runs, and gets me nearly there. It's filtering out the unwanted IDs
and making a mean (and SD) calculation for a defined interval of time for each person.
But it's not quite correct: in lubridate::time_length
, it isn't specifying the condition "max service date where treatment==1
" minus "min service date where treatment==1
". (The bolded parts are what's missing, and needed.)
How do I get it to do that?
I've tried something like this but it just throws an error:
d %>%
group_by(ID) %>%
filter(sum(treatment) >1) %>%
mutate(treatment_years = lubridate::time_length(max(service_date) & treatment==1 - min(service_date) & treatment==1, unit = "year")) %>%
ungroup() %>%
summarise(avg = mean(treatment_years),
sd = sd(treatment_years))
CodePudding user response:
We may subset the service_date
with a logical vector treatment == 1
i.e. service_date[treatment == 1]
(assuming there is at least one 'treatment' level 1)
library(dplyr)
library(lubridate)
d %>%
group_by(ID) %>%
filter(sum(treatment) >1) %>%
summarise(treatment_years = lubridate::time_length(max(service_date[treatment == 1]) -
min(service_date[treatment == 1]), unit = "day"), .groups = 'drop') %>%
summarise(avg = mean(treatment_years),
sd = sd(treatment_years))
-output
# A tibble: 1 × 2
avg sd
<dbl> <dbl>
1 532 89.1
CodePudding user response:
An option using by
and just subtracting the treated dates.
by(d, d$ID, \(x) {
if (all(x$treatment == 0)) NA_real_
else diff(x$service_date[x$treatment == 1]) |> as.numeric()
}) |> unlist() |> {\(x) c(mean=mean(x, na.rm=TRUE), sd=sd(x, na.rm=TRUE))}()
# mean sd
# 532.00000 89.09545